Elementolab/Cancer Exome Mutation Discovery pipeline

From Icbwiki

Jump to: navigation, search

This is the wiki page for the current Elemento lab cancer exome analysis pipeline.

All components (scripts, files, etc) in this pipeline are available from SNVseeqer and CNVseeqer

You can get these scripts from our SVN repository using:

svn co https://pbtech-vc.med.cornell.edu/public/svn/elementolab/SNPseeqer/trunk SNPseeqer/ 
svn co https://pbtech-vc.med.cornell.edu/public/svn/elementolab/CNVseeqer/trunk CNVseeqer/ 
svn co https://pbtech-vc.med.cornell.edu/public/svn/elementolab/BIO-C/trunk BIO-C/
svn co https://pbtech-vc.med.cornell.edu/public/svn/elementolab/PERL_MODULES/trunk PERL_MODULES/ 


Contents

Alignment

We use BWA with default parameters for alignment.

For single-end reads

perl ~/PROGRAMS/SNPseeqer/SCRIPTS/PBS_quickAlign.pl --fastq=s_1_sequence.txt.gz --bwaidx=/pbtech_mounts/oelab_scratch002/ole2001/DATA/GENOMES/hg19/hg19bwaidx --submit=1 

For paired-end

perl ~/PROGRAMS/SNPseeqer/SCRIPTS/PBS_quickAlignPE.pl --fastq1=s_1_1_sequence.txt.gz --fastq1=s_1_2_sequence.txt.gz \ 
 --bwaidx=/pbtech_mounts/oelab_scratch002/ole2001/DATA/GENOMES/hg19/hg19bwaidx --submit=1

(use --submit=0 to see the PBS script without submitting)


If you have multiple samples compared to the same reference, you can produce a table to summarize the results using the following command:

perl ~/PROGRAMS/SNPseeqer/SCRIPTS/TargetID_summarizeClones.pl --clonefile=desc_samples.txt --qspv=0.1 --nohmz=0 --zeropar=1 --fdr=0.05 --mincnt=1

where desc_samples.txt is a tab-delimited file containing sample ID and .cmp file, for example

1       ./s_1_sequence.txt.hg18.sorted.sam.SNVs.dbSNP.annot.1KG.kmut.TFBS.Cons.flseq.qs.cmp
2       ./s_2_sequence.txt.hg18.sorted.sam.SNVs.dbSNP.annot.1KG.kmut.TFBS.Cons.flseq.qs.cmp
3       ./s_3_sequence.txt.hg18.sorted.sam.SNVs.dbSNP.annot.1KG.kmut.TFBS.Cons.flseq.qs.cmp

SNV analysis

This analysis requires that SAM files have been created and split. See the SNVseeqer page for details.

perl ~/PROGRAMS/SNPseeqer/SCRIPTS/PBS_SNPseeqerPB_Genome --file=tumor_sequence.txt.sam --refgene=refGene.txt --genome=hg19 \
 --comparedir=control_sequence.txt.sam_SPLIT --cmpfdr=0.1 --submit=1 --pvtype="BI" --walltime="48:00:00"

Indel analysis

This is a more complicated analysis as it requires realigning BAM files:

perl ~/PROGRAMS/SNPseeqer/SCRIPTS/PBS_realignBAM.pl --bamfile=tumor_sequence.txt.sorted.bam --submit=1
perl ~/PROGRAMS/SNPseeqer/SCRIPTS/PBS_realignBAM.pl --bamfile=control_sequence.txt.sorted.bam  --submit=1

This process will create two new files:

tumor_sequence.txt.sorted.dedup.rg.resorted.realigned.bam 
control_sequence.txt.sorted.dedup.rg.resorted.realigned.bam

Next step is to call indels:

perl ~/PROGRAMS/SNPseeqer/SCRIPTS/PBS_runGATKindelDetector.pl --tumorbamfile=tumor_sequence.txt.sorted.dedup.rg.resorted.realigned.bam \ 
  --controlbamfile=control_sequence.txt.sorted.dedup.rg.resorted.realigned.bam --submit=1

When these scripts finish, one then needs to annotate the indels:

~/PROGRAMS/SNPseeqer/annotateGATKindels -indelfile CM_Breast_sequence.txt.sorted.dedup.rg.resorted.realigned.bam.cmp.indels.proc \ 
  -refgene $SNVSEEQERDIR/REFDATA/hg19/refGene.txt.17Jan2012  -chrdata $SNVSEEQERDIR/REFDATA/hg19.chrdata

Expected outcome:

chr10   52751088        52751088        +GCT    30      51      0       19      2.06795540680936e-06    5.61997292909367e-05    Transcript:NM_001098512,1,PRKG1,1,5UTR|Promoter:NM_001098512,PRKG1,1,178_bp_DN_TSS
chr14   86090003        86090004        -A      29      78      0       53      2.11283958289567e-08    8.87392624816183e-07    Transcript:NM_013231,2,FLRT2,1,3UTR
...

CNV analysis

Step 1 is to get a read number ratio for all exons on the exome cap platform:

~/PROGRAMS/CNVseeqer/CNVseeqer -chrdata $HOME/PROGRAMS/SNPseeqer/REFDATA/hg19.chrdata -bamfile1 tumor_sequence.txt.sorted.bam -bamfile2 control_sequence.txt.sorted.bam -captured $HOME/PROGRAMS/CNVseeqer/REFDATA/SureSelect_All_Exon_50Mb_hg19.txt -verbose 1 -outfile tumor_control_CNV.txt

The output should look like this. The ratio is tumor over control (bamfile1/bamfile2).

chr1    14467   14587   705     541     9.456e-02       1.593e-01       0.907
chr1    14639   14883   1090    978     8.976e-09       5.886e-08       0.776

Step 2 involves segmenting the output of CNVseeqer. This is done in R using the excellent DNAcopy package

To install DNAcopy (need to do this only once) in R

source("http://bioconductor.org/biocLite.R")
biocLite("DNAcopy")

If the hg19.chrdata contains partial chrom names with "_", it needs to be cleaned up

grep -v "_" tumor_control_CNV.txt > tumor_control_CNV_cleaned.txt

Then in R:

library(DNAcopy)
cn <- read.table("tumor_control_CNV_cleaned.txt", header=F)
cn <- cn[ (cn[,4]+cn[,5])>100, ]  # filters noise (low read counts)
CNA.object <-CNA( genomdat = log2(cn[,8]), chrom = cn[,1], maploc = cn[,2]+(cn[,3]-cn[,2])/2, data.type = 'logratio', sampleid="DLBCL patient")
CNA.smoothed <- smooth.CNA(CNA.object)
segs <- segment(CNA.smoothed, undo.splits = "sdundo",  undo.SD = 1.5, verbose = 1, min.width=5)
plot(segs, plot.type="s", ylim=c(-2,2))
write.table(segs$output, file="segments.txt", row.names=F, col.names=F, quote=F, sep="\t")

Step 3: analyze the genes in amplified/deleted segments

perl  analyzeDNAcopy.pl --segmentfile=segments.txt --refgene=$HOME/PROGRAMS/SNPseeqer/REFDATA/hg19/refGene.txt.17Jan2012 | sort_column.pl 5 | less

LOH analysis

This analysis needs .qs files produced by SNVseeqer.

Create a tab-delimited file called snvfile.txt

tumor_fastq.pf.hg19.unq.sam.SNVs.dbSNP.annot.1KG.kmut.TFBS.Cons.flseq.qs    tumor
control_fastq.pf.hg19.unq.sam.SNVs.dbSNP.annot.1KG.kmut.TFBS.Cons.flseq.qs      normal


Make a LOH map:

perl ~/PROGRAMS/SNPseeqer/SCRIPTS/visualizeMultiSampleLOH.pl --snvlist=snvfile.txt --mincov=20 --minallfreq=0.25 | perl ~/PROGRAMS/SNPseeqer/SCRIPTS/LOH_normal_tumor_cleanup.pl - > table.txt

Visualize the LOH map:

/opt/local/bin/perl5.16 ~/PROGRAMS/SNPseeqer/SCRIPTS/DrawLOHmapImproved.pl --snvdata=table.txt  --chrdata=/$HOME/PROGRAMS/SNPseeqer/REFDATA/hg19.chrdata

Options

--tumorlabel=R1 --normallabel=D  # change the tumor/normal labels


Loading mutations into the cBio portal

Create a directory with relevant files:

perl  ~/PROGRAMS/SNPseeqer/SCRIPTS/cBio_add_sample.pl --dir=pmpca_ctrl --cmpfile=s_1_sequence.txt.sam.SNVs.dbSNP.annot.1KG.kmut.TFBS.Cons.flseq.qs.cmp  \
 --fdr=0.05 --studyid=wcmc_pm_pca --patientid=WCMC-PM-1003 --studyname="WCMC PM PCa exome-seq" \
 --studydesc="PCa patient exome-seq, WCMC Institute for Precision Medicine" --histotype="Bone metastase" --cancertype=prad

Add --minfreq=0 --maxk0=0 if you used --usepv=1 instead of FDR calculation

Upload

scp -r -i ~/.ssh/yourawskey.pem pmpca/  ec2-user@ec2-50-19-32-49.compute-1.amazonaws.com:/home/ec2-user/portal-data/studies/


Login and load data:

ssh -i ~/.ssh/yourawskey.pem pmpca/ ec2-user@ec2-50-19-32-49.compute-1.amazonaws.com
cd $PORTAL_HOME/core/src/main/scripts
$PORTAL_HOME/core/src/main/scripts/importCancerStudy.pl $PORTAL_DATA_HOME/studies/pmpca/meta_study.txt
$PORTAL_HOME/core/src/main/scripts/importCaseList.pl $PORTAL_DATA_HOME/studies/pmpca/case_lists
$PORTAL_HOME/core/src/main/scripts/importClinicalData.pl $PORTAL_DATA_HOME/studies/pmpca/data_clinical.txt wcmc_pm_pca
$PORTAL_HOME/core/src/main/scripts/importProfileData.pl --data $PORTAL_DATA_HOME/studies/pmpca/data_mutations_extended.txt \
  --meta $PORTAL_DATA_HOME/studies/pmpca/meta_mutations_extended.txt --dbmsAction clobber
sudo /etc/init.d/tomcat6 stop
sudo /etc/init.d/tomcat6 start

To load CNAs: (data_CNA.txt being matrix of +1, 0, -1 etc for each gene, typically generated by GISTIC)

$PORTAL_HOME/core/src/main/scripts/importProfileData.pl --data $PORTAL_DATA_HOME/studies/baca/data_CNA.txt \
  --meta $PORTAL_DATA_HOME/studies/baca/meta_CNA.txt --dbmsAction clobber


The cBio portal does not support incremental updates; therefore if you want to add new patients to an existing study, you \ need to delete the database, and reload the updated one from scratch

mysql -u root -p
[enter without password]
drop database cgds_public
create database cgds_public
exit

Then:

initdb
importreferencedata
# import your dataset using the commands above
...
deployportal
Personal tools