Elementolab/SNVseeqer Tutorial mRNA

From Icbwiki

Jump to: navigation, search

Elementolab/

Contents

Introduction

This tutorial assumes that short reads (GAIIx) were mapped to RefSeq transcripts. Please visit our BWA tutorial page in order to learn how to use BWA to do so.

Related pages

Prep work

Split mapped read files into one file per mRNA

Read files are usually in SAM format (but Eland or Eland Extended formats are also supported using scripts in the SCRIPTS/ directory).

./split_samfile s_1_sequence.txt.sam -outdir .

The above commands will create files such as read.NM_000767 (one per mRNA).

Run SNVseeqer

  • To identify single nucleotide mutations, run SNVseeqer on the SAM alignment files
 ./SNVseeqerPB -readdir . -fastaref REFDATA/refGene.txt.25Nov2009.fa -format sam \ 
   -readlen 42 -verbose 1 -minnumreads 4 -correction BH -fdr 0.01 -pvtype PB -uniquereads 1 -outfile SNVs.txt

The options are as follows:

-readlen INT     # specifies read length (bp)
-minnumreads INT # minimum coverage required (default is 4 reads)
-correction STR  # type of correction for multiple testing, BH (Benjamini-Hochberg) or BI (Bonferroni)
-fdr FLT         # FDR for BH correction
-pvtype STR      # type of p-value 
-uniquereads INT # if 1, only 1 read starting at each genomic position is retained. If 0, all reads are used (1 is prefered)
-verbose INT     # verbose output
-format STR      # SAM or eland
-fastaref FILE   # reference sequences. Only required for reads mapped to RefSeq transcripts. The REFDATA/refGene.txt.25Nov2009.fa file is distributed with SNVseeqer.
-readdir DIR     # directory that contains the reads (one file per RefSeq transcript)
-outfile FILE    # output file (if not specified, output is sent to screen)

Please see the output format here.

Downstream analyses

Annotate mutations with polymorphism information

This step determines which of the SNVs detected by SNVseeqer are known polymorphisms based on the dbSNP database.

./dbSNPmatchesRefSeq -snvfile SNVs.txt -refseqsnpfile REFDATA/refGene.txt.25Nov2009.dbSNPmap > SNVs_withdbSNP.txt

Optional: filter out SNVs that do not have the features of known SNPs

This step filters out mutations with strand bias and/or mutation bias.

./filterSNVsUsingdbSNPdata -snvfile SNVs_withdbSNP.txt -quantile 0.05 -entstr 1 -entmut 1 > SNVs_withdbSNP_filtered.txt

Annotate SNVs in RefSeq genes as synonymous, missense, etc

This step provide important information regarding the biological impact of a mutation. The location (e.g. promoter, 3UTR, 5UTR), type (e.g. synonymous, missense), and impact (BLOSUM62 score) will be added to each mutation. For more detailed analysis of the effect of a mutation, the user can use PolyPhen.

./annotateSNVsRefSeq -snvfile SNVs_withdbSNP.txt -refgene REFDATA/refGene.txt.25Nov2009 \ 
  -fastaref REFDATA/refGene.txt.25Nov2009.fa > SNVs_withdbSNP_annotated.txt

Note: output amino acid residue number is 1-based.

Evaluate the evolutionary importance (conservation) across other species

This step evaluates the conservation of the mutated position at the amino acid level across multiple species using protein sequences from all ENSEMBL species.

./evaluateRefSeqAAConservation -snvfile SNVs_withdbSNP_annotated.txt -reflink REFDATA/refLinkrefGene.txt.26Dec2009 \
  -alndir ALN/ -refgene REFDATA/refGene.txt.25Nov2009 > SNVs_withdbSNP_annotated_withAAcons.txt

where:

-reflink FILE  # indicates NM to NP correspondence file (the file above is provided with SNVseeqer in REFDATA)
-alndir DIR    # contains alignment files (see SNVseeqer installation section)

Evaluate conservation of 3'UTR SNVs

For non-coding mutations in the 3' untranslated region (3'UTR), we evaluate the impact of the mutation using conservation scores calculated by PhyloP.

./evaluateRefSeq3UTRConservation -consdir REFSEQ3UTRCONS/ -snvfile SNVs_withdbSNP_annotated_withAAcons.txt \
 -refgene REFDATA/refGene.txt.25Nov2009 > SNVs_withdbSNP_annotated_withAAcons_with3UTRcons.txt


Determine whether SNVs disrupt/create miRNA seeds

./evaluateRefSeqOverlapWithMiRNAseeds -snvfile SNVs_withdbSNP_annotated_withAAcons_with3UTRcons.txt \
 -fastaref REFDATA/refGene.txt.25Nov2009.utr3seq -refgene REFDATA/refGene.txt.25Nov2009 -seedfile REFDATA/mirnas-29DEC2009.seeds \ 
  > SNVs_withdbSNP_annotated_withAAcons_with3UTRcons_withmiRNAs.txt

Add known mutation information (from Cosmic and Uniprot)

./FindKnownAAMutations.pl --snvfile=s_2_sequence.txt.sam.SNVs.dbSNP.annotated --anntype=refseq 

Options: --showmutonly=1 to hide non-mutated SNVs


Visualization

We can visualize the stack of reads covering a given RefSeq position using:

./FindReadsThatOverlapWithSNP -pos 923 -format sam -readfile TEST/REFSEQ/P3/reads.NM_002648 -fastafile TEST/REFSEQ/NM_002648.fa -uniquereads 1 -offset 40

Example output:

0	F	       885	         -	 C.........................................
1	F	       886	         -	  ......................................C..C
2	F	       893	         -	         ..........................................
3	F	       896	         -	            ..........................................
4	F	       903	         -	                   ..........................................
5	R	       904	         -	                    .....C..............C.....................
6	F	       911	         -	                           ..........................................
7	F	       912	         -	                            .........................................T
8	R	       914	         -	                              .....C.C..C...............................
9	F	       914	         -	                              .........................................T
10	R	       915	         -	                               ......C...................................
11	F	       915	         -	                               ..........................................
12	R	       918	         -	                                  C..C..C...................................
		          	          	CTGGAGGCCGTGCGGCACTGCCACAACTGCGGGGTGCTCCACCGCGACATCAAGGACGAAAACATCCTTATCGACCTCAA (REFERENCE)
               
		          	          	                                        ^

where:

-offset INT     # 1= mask identical bases with .
                # 0= change identical bases to lowercase

Misc.

Useful scripts

Personal tools