Elementolab/SNVseeqer Tutorial tools
From Icbwiki
- Multi-lane summarization (assumes a lanes.txt description file in current dir)
# summarize with coverage, min all freq, blosum62 filters, cancer genes $SNVSEEQERDIR/SCRIPTS/findMostFrequentSNVs.pl --dirs="." --canceronly=1 --suffix=".1000genomes.captured.kmut.flseq" \ --mincov=20 --minallfreq=0.25 --bl62min=1 --fastafile=REFDATA/hg18/wg.fa > SNVSummaryMissenseNonsense.txt # pretty excel table table2xls.pl SNVSummaryMissenseNonsense.txt --show=1 --useheaderrow=1 --header=1 # summarize non-coding (intron+promoters) mutations $SNVSEEQERDIR/SCRIPTS/findMostFrequentSNVs.pl --files="*.TFBS.Cons.flseq.capt" --canceronly=0 \ --mincov=30 --minallfreq=0.33 --noncoding=1 --contpat=Mert --maxnumcont=0 --has1KG=1 \ --fastafile=/Users/olivier/PROGRAMS/ChIPseeqer-1.0/DATA/hg18/WG/wg.fa
Options:
--canceronly=INT # (1 = Cancer Gene Census only, 0 = no filter) --mincov=INT # (min read coverage) --minallfreq=INT # minimum mutated allele frequency --noncoding=INT # (1 = only non-coding mutation) --sense=INT # (1 = show sense mutations only) --contpat=PAT # pattern to recognize control samples (using lanes.txt) --fastafile=FILE # genome file to detect context --maxnumcont=INT # filter out mutations found in that many samples
Same for indels:
$SNVSEEQERDIR/SCRIPTS/findMostFrequentINDELS.pl --dirs=. --canceronly=0 --minallfreq=0.25 --mincov=20 --fastafile=REFDATA/hg18/wg.fa > IndelSummary.txt table2xls.pl IndelSummary.txt --show=1 --useheaderrow=1 --header=1
- To create an RNA-seq Wiggle Track for the UCSC Genome Browser (GB)
If not yet done, split your SAM file (or Eland, etc) into one file per chromosome
$SNVSEEQERDIR/split_samfile s_1_12_sequence.txt.sam -outdir s_1_12_sequence.txt.sam_SPLIT
Then use the following script:
$SNVSEEQERDIR/SCRIPTS/makeWigTrack.pl --samfile=s_1_12_sequence.txt.sam --trackname="NB RNA-seq" --bigwig=1 --upload=1
The command above creates tracks in the BigWig format. What that means is that you'll need to upload the tracks to a web server, so that the GB can access them. Then, to load the track, you'll need to paste code like this into the GB Custom Track page:
track type=bigWig name="NB RNA-seq" description="NB RNA-seq" bigDataUrl=http://physiology.med.cornell.edu/faculty/elemento/lab/files/s_1_12_sequence.txt.sam.bw visibility="full" maxHeightPixels="64:64:11" smoothingWindow="10" viewLimits="0:100" autoScale="off"
The other thing is that you need to download a binary WigToBIgWig tool for your platform (mac or linux) and copy it into $SNVSEEQERDIR/SCRIPTS
Binaries are at http://hgdownload.cse.ucsc.edu/admin/exe/
The makeWigTrack program also normalizes the read counts by the total number of reads (RPKM-style), so that different wiggle tracks can be compared.
Additional options:
--chrdata=FILE # to specificy chromosome names and lengths --fraglen=INT # to extend to a certain length (fraglen) --gzip=INT # zip track
- To covert SAM read files to read coverage plots
$SNVSEEQERDIR/SCRIPTS/getReadCoverage.pl -read output_mRNA/split_reads_dir/ -outdir coverage_dir -uniq
Options:
-readsdirdb directory containing the split reads (e.g. reads.NM_000027)
-outdir output directory
-[no]uniq record only unique reads (default= 0)
- Summarize the gene expression (rpkm) for all coverage files in a directory
$SNVSEEQERDIR/SCRIPTS/sumDigitalExp.pl coverage_dir/ > output_readsum.txt Example output file and format: # nmid 3utr_reads 3utr_len 5utr_reads 5utr_len coding_reads coding_len total_reads len NM_000016 1524 907 236 430 8880 1266 10640 2603 NM_000046 9657 3183 258 1286 3765 1602 13680 6071 p.s. here 'reads' refers to the number of mapped bases.
- Compare gene expression between two RNA-seq datasets
R --slave --args output_readsum1.txt output_readsum2.txt output_file_name < $SNVSEEQERDIR/SCRIPTS/anlzDigExp.r
this takes two read summary (_readsum*.txt) files and generates two files:
1. output_file_name_desc.txt: contains differential gene expression for each gene in the two RNA-seq datasets;
also contains annotations from RefSeq
2. output_file_name_hist.pdf: histogram of read counts
Note:
1. You have to install the statistical analysis software [R].
2. To allow the script add refSeq annotations to the summary output, put the refSeq annotation file refSeqSummary.txt.gz [download here] in your REFDATA/ before running this analysis.
- Generating read coverage plots
R --slave --args coverage_dir s_1_12 s_2_12 s_3_12 s_4_12 plots_dir < $SNVSEEQERDIR/SCRIPTS/plotReadCoverage.r This generates read coverage plots (plots_dir/NM_*.jpg) for all RefSeq genes in your samples. The plots allow you to compare digital gene expression among several RNA-seq experiments. Note: The first argument refers to the root directory of your coverage files. The last argument refers to the output directory we will generate coverage plots to. The arguments in the middle are names of the subdirectories containing coverage data of your samples. The height at a position in a line is proportional to the number of reads at that position in the sample.
- Plotting Venn diagrams
You can enter two lists of genes and plot a Venn diagram at [http://physiology.med.cornell.edu/faculty/elemento/lab/snvdb/gene_venn.php]
- Inferring LOH from SNVseeqer output
findLOHregions.pl --snvfile=PB_tr0_uq1_dbSNP.txt --showp=1 --chr=chr18 > SNVchr18.txt embedSNVsInChrLongGrid.pl --snvlist=SNVchr18.txt --chr=chr18 --chrdata=REFDATA/hg18.chrdata > SNVchr18.txt.embed
- Basic Exon junction analysis
/home/ole2001/PROGRAMS/SNPseeqer/CalcExonJunctionRPKM -sbamfile s_1_sequence.txt.sorted.bam -exondata /home/ole2001/PROGRAMS/SNPseeqer/REFDATA/mm9/refGene.txt.25MAY2010.exondata -exoniclength 50 -verbose 1 -uniquereads 1 perl ~/PROGRAMS/SNPseeqer/SCRIPTS/CombineExonJunctionLaneData.pl --files="s_*.GRBAM" > JunctionDataUNQ.txt perl /home/ole2001/PROGRAMS/SNPseeqer/SCRIPTS/GrepGeneInJunctionList.pl JunctionDataUNQ.ordered.txt Cacna1c 1 perl /home/ole2001/PROGRAMS/SNPseeqer/SCRIPTS/FindSkippedExons.pl JunctionDataUNQ.ordered.txt | expression_annotate_mouse_ORFs.pl - > JunctionDataUNQ.ordered.SkippedExons.txt
