Elementolab/Deep Sequencing Resources

From Icbwiki

Jump to: navigation, search


Complete Genomics analysis

Sequence aligners

Sequence database

Processing reads

  • Download and decompress a TAR file (e.g. from Casava 1.8) on the fly
wget -O - http://www.some_server.edu/somepath/Sample_A1.tar | tar xvf -
  • Extract paired-end reads from a Cassava tar file
tar  xvf SAMPLE.tar *_R1_*.fastq.gz --to-stdout > SAMPLE_R1.fastq.gz 2> SAMPLE_R1_files.txt
tar  xvf SAMPLE.tar *_R2_*.fastq.gz --to-stdout > SAMPLE_R2.fastq.gz 2> SAMPLE_R2_files.txt
  • Basic read quality statistics
FASTX at Cold Spring Harbor Laboratory
  • Illumina CASAVA 1.8 changes
    • FASTQ quality scores now have an offset of 33. I.e. ASCII= base quality + 33
    • See details here
  • To extract pass-filter reads from CASAVA 1.8 files
cd /path/to/project/sample
   mkdir filtered
   for fastq in *.fastq.gz ; do 
       zcat $fastq |
       grep -A 3 '^@.* [^:]*:N:[^:]*:' |
       grep -v "^--$" | gzip > filtered/${fastq}.gz

Processing alignments

  • Samtools
  • Converting SAM to BAM (assuming SAM file has @ header):
samtools view -bS s_1_sequence.txt.sam > s_1_sequence.txt.bam
samtools sort s_1_sequence.txt.bam s_1_sequence.txt.sorted
samtools index s_1_sequence.txt.sorted.bam
  • Getting the read length from a SAM file
len=`zcat ${samfile}|grep chr1|head -1|cut -f10|wc -c`        
let len=($len-1)      
echo "len= ${len}"

NGS analyses

Gene fusion detection

  • TopHat-Fusion usage and tips:
    • Download here
    • A more detailed description of our lab pipeline can be found here
    • Annotation files:
sort -k3,3 -k5,5n refGene.txt > refGene_sorted.txt

Detection of insertions and deletions (indels)

  • Dindel is a program for calling small indels from next-gen sequencing data. It is currently handles only Illumina data and uses the aligned BAM files as input.
  • Download the program from here
  • Related: here

CHIP-seq analysis

Interval Comparison tool (available in ChIPseeqer package)

# compare -2kb+1kb promoters of RefSeq genes (with gene identifiers)
./CompareIntervals -intervals1 DATA/refGene_data_7June2009.txt.u2000_d1000 -hasid1 1 -intervals2 ChIPseeqer_targets.txt -showprofile 1
  • k-mer counting using GenomeTools from Stefan Kurtz and colleagues:

First, download the GenomeTools from http://genometools.org/

Then use the following commands to compile and run the program:

make cairo=no
make prefix=$HOME/usr cairo=no install 
gt suffixerator -dna -pl -tis -suf -lcp -v -parts 4 -db hg18/wg.fa -indexname reads  # we had to use -parts 10 on della for hg19
gt tallymer mkindex -mersize 30 -minocc 1 -indexname tyr-reads -counts -pl -esa reads -scan
gt tallymer search -output qseqnum qpos counts sequence -tyr tyr-reads -q hg18/wg.fa

The full manual for the tallymer software can be found at http://www.zbh.uni-hamburg.de/tallymer/tallymer.pdf

Sequence alignment for longer sequences

  • BLAT for read mapping
/Users/olivier/bin/blat -tileSize=6  -t=dna -q=dna ../refMrna.fa TEST/test_read.fa /dev/stdout
  • LAST




 -a is gap penalty
 -e score threshold
 match/mismatch penalties are +1/-1 by default

Need to add this to src/makefile in CC and G++ flags

-arch x86_64 

Make the index

src/lastdb -v hg18db hg18/WG/wg.fa

Run alignment

src/lastal  -a2 -e30 hg18db myreads.fa

Parallel alignment (split fasta file into 10 files, then align each one using LAST)

perl PBS_lastal.pl myreads.fa  10 myhg18db [0/1]
Personal tools