Elementolab/Deep Sequencing Resources
From Icbwiki
Contents |
[edit]
Complete Genomics analysis
- assessing genome sequencing quality
- format description, including summary (p62) and MasterVar (p25)
- Complete Genomics Structural Variations
- look up variant in ESP
- http://www.1000genomes.org/faq/how-can-i-get-allele-frequency-my-variant
[edit]
Sequence aligners
- BWA
- Please visit our BWA_tutorial for more information.
- TopHat
- Please see our TopHat Tutorial for more information.
- BWA
[edit]
Sequence database
- SRA (Sequence Read Archive)
- SRA toolkit is available here
- Convert SRA files to FASTQ format tutorial
- Instructions for submitting sequence files to SRA/GEO http://www.ncbi.nlm.nih.gov/geo/info/seq.html
- SRA (Sequence Read Archive)
[edit]
Processing reads
- Download and decompress a TAR file (e.g. from Casava 1.8) on the fly
wget -O - http://www.some_server.edu/somepath/Sample_A1.tar | tar xvf -
- Extract paired-end reads from a Cassava tar file
tar xvf SAMPLE.tar *_R1_*.fastq.gz --to-stdout > SAMPLE_R1.fastq.gz 2> SAMPLE_R1_files.txt tar xvf SAMPLE.tar *_R2_*.fastq.gz --to-stdout > SAMPLE_R2.fastq.gz 2> SAMPLE_R2_files.txt
- Basic read quality statistics
FASTX at Cold Spring Harbor Laboratory
- Illumina CASAVA 1.8 changes
- FASTQ quality scores now have an offset of 33. I.e. ASCII= base quality + 33
- See details here
- To extract pass-filter reads from CASAVA 1.8 files
cd /path/to/project/sample
mkdir filtered
for fastq in *.fastq.gz ; do
zcat $fastq |
grep -A 3 '^@.* [^:]*:N:[^:]*:' |
grep -v "^--$" | gzip > filtered/${fastq}.gz
done
[edit]
Processing alignments
- Samtools
http://sourceforge.net/apps/mediawiki/samtools/index.php?title=SAM_FAQ
- Converting SAM to BAM (assuming SAM file has @ header):
samtools view -bS s_1_sequence.txt.sam > s_1_sequence.txt.bam samtools sort s_1_sequence.txt.bam s_1_sequence.txt.sorted samtools index s_1_sequence.txt.sorted.bam
- Getting the read length from a SAM file
len=`zcat ${samfile}|grep chr1|head -1|cut -f10|wc -c`
let len=($len-1)
echo "len= ${len}"
[edit]
NGS analyses
[edit]
Gene fusion detection
- TopHat-Fusion usage and tips:
- Download here
- A more detailed description of our lab pipeline can be found here
- Annotation files:
- The annotation files included in the package are hg19 based. To use hg18 based annotations, download refGene.txt and ensGene.txt from UCSC (http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/), coordinate sort and rename to something like refGene_sorted.txt. E.g.
sort -k3,3 -k5,5n refGene.txt > refGene_sorted.txt
[edit]
Detection of insertions and deletions (indels)
- Dindel is a program for calling small indels from next-gen sequencing data. It is currently handles only Illumina data and uses the aligned BAM files as input.
- Download the program from here
- Related: here
[edit]
CHIP-seq analysis
- ChIPseeqer is the lab's comprehensive ChIP-seq analysis framework: http://physiology.med.cornell.edu/faculty/elemento/lab/chipseq.shtml
Interval Comparison tool (available in ChIPseeqer package)
# compare -2kb+1kb promoters of RefSeq genes (with gene identifiers) ./CompareIntervals -intervals1 DATA/refGene_data_7June2009.txt.u2000_d1000 -hasid1 1 -intervals2 ChIPseeqer_targets.txt -showprofile 1
- k-mer counting using GenomeTools from Stefan Kurtz and colleagues:
First, download the GenomeTools from http://genometools.org/
Then use the following commands to compile and run the program:
make cairo=no make prefix=$HOME/usr cairo=no install gt suffixerator -dna -pl -tis -suf -lcp -v -parts 4 -db hg18/wg.fa -indexname reads # we had to use -parts 10 on della for hg19 gt tallymer mkindex -mersize 30 -minocc 1 -indexname tyr-reads -counts -pl -esa reads -scan gt tallymer search -output qseqnum qpos counts sequence -tyr tyr-reads -q hg18/wg.fa
The full manual for the tallymer software can be found at http://www.zbh.uni-hamburg.de/tallymer/tallymer.pdf
[edit]
Sequence alignment for longer sequences
- BLAT for read mapping
/Users/olivier/bin/blat -tileSize=6 -t=dna -q=dna ../refMrna.fa TEST/test_read.fa /dev/stdout
- LAST
Download
http://last.cbrc.jp/archive/
Documentation
http://last.cbrc.jp/doc/tag-seeds.txt -a is gap penalty -e score threshold match/mismatch penalties are +1/-1 by default
Need to add this to src/makefile in CC and G++ flags
-arch x86_64
Make the index
src/lastdb -v hg18db hg18/WG/wg.fa
Run alignment
src/lastal -a2 -e30 hg18db myreads.fa
Parallel alignment (split fasta file into 10 files, then align each one using LAST)
perl PBS_lastal.pl myreads.fa 10 myhg18db [0/1]
