Next-gen sequencing data analyses
We use hadoop to parallelize alignment runs. Our parallelization strategy splits a genome as multiple reference sequences (i.e., one reference for each chromosome or un-mappable contig), maps all the reads to each reference, then combines results to remove reads that match too many references. This strategy is used as some tools (e.g., MAQ) have been optimized to work best with 7-8 million reads and are less efficient as the number of reads is decreased.
Processing of SOLID data
SOLID reads are measured and recorded in color-space, a representation that helps differentiate single base changes from measurement errors (ref?). Few tools map SOLID reads. Most are provided from the SOLID software community web site. Most community software tools are provided as open-source. They take some getting used to as their command line interfaces are non standard and non-intuitive (i.e., what about an option to specify you want to process all the sequences in the input file, not just the first?). Schrimp is another tool which maps color-space reads and could be more usable (?).
Xutao developed a tool to convert Mapread output to eland format Mapreads_output_to_Eland.
Processing of Illumina/Solexa
Various tools are available to map short reads. MAQ, Eland (speed), SeqMap from fastq format.
Processing of Helicos
MAQ does not support fasta format. Quality scores must be provided to convert fasta to fastq. Alternative to MAQ for helicos read mapping (eland should work).
Processing of 454 reads
Could use megablast or BLAT to align to the reference genome. BLAT should be preferred for speed when the material sequenced is expected to be very close to the reference genome available (i.e., human to human), while megablast should be preferred when the material sequenced may differ from the reference genome through a number of polymorphisms (i.e., a strain of mouse against a reference from another strain).