Elementolab/SNVseeqer Tutorial tools

From Icbwiki

Jump to: navigation, search
  • Multi-lane summarization (assumes a lanes.txt description file in current dir)
# summarize with coverage, min all freq, blosum62 filters, cancer genes
$SNVSEEQERDIR/SCRIPTS/findMostFrequentSNVs.pl --dirs="." --canceronly=1 --suffix=".1000genomes.captured.kmut.flseq" \ 
   --mincov=20 --minallfreq=0.25 --bl62min=1 --fastafile=REFDATA/hg18/wg.fa > SNVSummaryMissenseNonsense.txt

# pretty excel table
table2xls.pl SNVSummaryMissenseNonsense.txt --show=1 --useheaderrow=1 --header=1  

# summarize non-coding (intron+promoters) mutations
$SNVSEEQERDIR/SCRIPTS/findMostFrequentSNVs.pl --files="*.TFBS.Cons.flseq.capt" --canceronly=0 \
  --mincov=30 --minallfreq=0.33  --noncoding=1 --contpat=Mert --maxnumcont=0 --has1KG=1 \
  --fastafile=/Users/olivier/PROGRAMS/ChIPseeqer-1.0/DATA/hg18/WG/wg.fa 

Options:

 --canceronly=INT # (1 = Cancer Gene Census only, 0 = no filter)
 --mincov=INT     # (min read coverage)
 --minallfreq=INT # minimum mutated allele frequency
 --noncoding=INT  # (1 = only non-coding mutation)
 --sense=INT      # (1 = show sense mutations only) 
 --contpat=PAT    # pattern to recognize control samples (using lanes.txt)
 --fastafile=FILE # genome file to detect context
 --maxnumcont=INT # filter out mutations found in that many samples
 

Same for indels:

$SNVSEEQERDIR/SCRIPTS/findMostFrequentINDELS.pl --dirs=. --canceronly=0 --minallfreq=0.25 --mincov=20 --fastafile=REFDATA/hg18/wg.fa > IndelSummary.txt
table2xls.pl IndelSummary.txt --show=1 --useheaderrow=1 --header=1
  • To create an RNA-seq Wiggle Track for the UCSC Genome Browser (GB)

If not yet done, split your SAM file (or Eland, etc) into one file per chromosome

$SNVSEEQERDIR/split_samfile s_1_12_sequence.txt.sam -outdir s_1_12_sequence.txt.sam_SPLIT

Then use the following script:

$SNVSEEQERDIR/SCRIPTS/makeWigTrack.pl --samfile=s_1_12_sequence.txt.sam --trackname="NB RNA-seq" --bigwig=1 --upload=1

The command above creates tracks in the BigWig format. What that means is that you'll need to upload the tracks to a web server, so that the GB can access them. Then, to load the track, you'll need to paste code like this into the GB Custom Track page:

track type=bigWig name="NB RNA-seq" description="NB RNA-seq" 
 bigDataUrl=http://physiology.med.cornell.edu/faculty/elemento/lab/files/s_1_12_sequence.txt.sam.bw visibility="full" maxHeightPixels="64:64:11" 
 smoothingWindow="10" viewLimits="0:100" autoScale="off"

The other thing is that you need to download a binary WigToBIgWig tool for your platform (mac or linux) and copy it into $SNVSEEQERDIR/SCRIPTS

Binaries are at http://hgdownload.cse.ucsc.edu/admin/exe/

The makeWigTrack program also normalizes the read counts by the total number of reads (RPKM-style), so that different wiggle tracks can be compared.

Additional options:

--chrdata=FILE   # to specificy chromosome names and lengths
--fraglen=INT    # to extend to a certain length (fraglen)
--gzip=INT       # zip track


  • To covert SAM read files to read coverage plots
$SNVSEEQERDIR/SCRIPTS/getReadCoverage.pl -read output_mRNA/split_reads_dir/ -outdir coverage_dir -uniq
Options:
       -readsdirdb  directory containing the split reads (e.g. reads.NM_000027)
       -outdir      output directory
       -[no]uniq    record only unique reads (default= 0)
  • Summarize the gene expression (rpkm) for all coverage files in a directory
$SNVSEEQERDIR/SCRIPTS/sumDigitalExp.pl coverage_dir/ > output_readsum.txt
 
Example output file and format:
# nmid     3utr_reads  3utr_len 5utr_reads 5utr_len coding_reads coding_len total_reads    len
NM_000016        1524       907        236      430         8880       1266       10640   2603
NM_000046        9657      3183        258     1286         3765       1602       13680   6071
 
p.s. here 'reads' refers to the number of mapped bases.
  • Compare gene expression between two RNA-seq datasets
R --slave --args output_readsum1.txt output_readsum2.txt output_file_name < $SNVSEEQERDIR/SCRIPTS/anlzDigExp.r

this takes two read summary (_readsum*.txt) files and generates two files:
1. output_file_name_desc.txt: contains differential gene expression for each gene in the two RNA-seq datasets; 
                              also contains annotations from RefSeq
2. output_file_name_hist.pdf: histogram of read counts
 
Note: 
1. You have to install the statistical analysis software [R].
2. To allow the script add refSeq annotations to the summary output, put the refSeq annotation file refSeqSummary.txt.gz [download here] in your REFDATA/ before running this analysis.
  • Generating read coverage plots
R --slave --args coverage_dir s_1_12  s_2_12  s_3_12  s_4_12 plots_dir < $SNVSEEQERDIR/SCRIPTS/plotReadCoverage.r 
 
This generates read coverage plots (plots_dir/NM_*.jpg) for all RefSeq genes in your samples. The plots allow you to compare digital gene expression among several RNA-seq experiments.
  
Note: 
 The first argument refers to the root directory of your coverage files.
 The last argument refers to the output directory we will generate coverage plots to. 
 The arguments in the middle are names of the subdirectories containing coverage data of your samples.
 The height at a position in a line is proportional to the number of reads at that position in the sample.
middle
  • Plotting Venn diagrams
You can enter two lists of genes and plot a Venn diagram at [http://physiology.med.cornell.edu/faculty/elemento/lab/snvdb/gene_venn.php]
middle
  • Inferring LOH from SNVseeqer output
findLOHregions.pl --snvfile=PB_tr0_uq1_dbSNP.txt --showp=1 --chr=chr18 > SNVchr18.txt
embedSNVsInChrLongGrid.pl --snvlist=SNVchr18.txt --chr=chr18 --chrdata=REFDATA/hg18.chrdata  > SNVchr18.txt.embed


  • Basic Exon junction analysis
/home/ole2001/PROGRAMS/SNPseeqer/CalcExonJunctionRPKM -sbamfile s_1_sequence.txt.sorted.bam -exondata /home/ole2001/PROGRAMS/SNPseeqer/REFDATA/mm9/refGene.txt.25MAY2010.exondata -exoniclength 50 -verbose 1 -uniquereads 1
perl ~/PROGRAMS/SNPseeqer/SCRIPTS/CombineExonJunctionLaneData.pl --files="s_*.GRBAM" > JunctionDataUNQ.txt 
perl /home/ole2001/PROGRAMS/SNPseeqer/SCRIPTS/GrepGeneInJunctionList.pl  JunctionDataUNQ.ordered.txt Cacna1c 1
perl /home/ole2001/PROGRAMS/SNPseeqer/SCRIPTS/FindSkippedExons.pl  JunctionDataUNQ.ordered.txt  | expression_annotate_mouse_ORFs.pl - > JunctionDataUNQ.ordered.SkippedExons.txt
Personal tools