Elementolab/SNVseeqer Tutorial genome

From Icbwiki

Jump to: navigation, search

Elementolab/

Contents

Introduction

Related pages

Prep work

  • This tutorial assumes that short reads (GAIIx) were already mapped to the genome. Please visit our BWA tutorial page in order to learn how to use BWA to do so.
  • Split mapped read files into chromosome files.

Currently, we only support reads in Eland, Eland Extended and SAM format.

Assume that we have two files containing the mapped reads and that they are in "Eland Extended" format.

SCRIPTS/snpseq_split_extended_eland_files.pl s_3_eland_extended.txt  s_4_eland_extended.txt

If they are in Eland format:

SCRIPTS/snpseq_split_eland_files.pl s_3_eland.txt  s_4_eland.txt

If they are in SAM format:

./split_samfile s_1_sequence.txt.sam

The above commands will create reads.chr1, reads.chr2, etc (or read.NM_000767, etc if reads were mapped to RefSeq genes).

Running SNVseeqer:

 ./SNVseeqerPB -readdir . -format sam -readlen 42 -verbose 1 -minnumreads 4 -correction BH -fdr 0.01 -pvtype PB -uniquereads 1 -outfile SNVs.txt

The options are as follows:

-readlen INT     # specifies read length (bp)
-minnumreads INT # minimum coverage required (default is 4 reads)
-correction STR  # type of correction for multiple testing, BH (Benjamini-Hochberg) or BI (Bonferroni)
-fdr FLT         # FDR for BH correction
-pvtype STR      # type of p-value [PB= Poisson-binomial]
-uniquereads INT # if 1, only 1 read starting at each genomic position is retained. If 0, all reads are used (1 is prefered)
-verbose INT     # verbose output if 1
-format STR      # SAM or eland
-readdir DIR     # directory that contains the reads (one file per RefSeq transcript)
-outfile FILE    # output file (if not specified, output is sent to screen)

Note that we don't need to specify the -fastaref option when running against the whole genome. However, when running against the refSeq genes, it is necessary to provide SNVseeqerDB with the gene sequences (a fasta file, e.g. refGene.txt.25Nov2009.fa).


Determine which ones of the SNVs are in known SNPs in dbSNP.

This procedure requires that dbSNP be downloaded and split into chromosome files (see SNVseeqer installation).

./dbSNPmatches -snvfile SNVs.txt -snpprefix snp130.txt > SNVs_withdbSNP.txt

Determine which SNVs are in the 1000 Genomes project

$SNVSEEQERDIR/MatchGenomicSNVwith1000GenomesData -snvfile SNVs_withdbSNP.txt -dir1000 DATA/1000Genomes > SNVs_withdbSNP_w1000g.txt

The DATA/1000Genomes is supposed to contain 3 folders:

pilot1/
pilot2/
pilot3/

You can find the content of these folders

mkdir pilot1
cd pilot1
lftp ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_03/pilot1/
mget *

cd ..
mkdir pilot2
etc

Optional: filter out SNVs that do not look like SNPs in dbSNP.

./filterSNVsUsingdbSNPdata  -snvfile SNVs_withdbSNP.txt -entmut 1 -entstr 1 > SNVs_withdbSNP_filtered.txt

Annotate SNVs using RefSeq genes

This program tells you whether SNVs fall in coding regions, promoters, 3'UTRs, or 5'UTRs. If in coding region, the program also tells you whether the SNV is missense, nonsense or synonymous.

./annotateSNVs -snvfile SNVs_withdbSNP.txt -refgene REFDATA/refGene.txt.7June2009 > SNVs_withdbSNP_annotated.txt

Other options include:

-uptss INT          # define how many bp uptstream of the TSS promoters start (default is 2000bp). 
-dntss INT          # same but downstream of TSS (default is 2000bp). Altogether, promoters are defined as 4kb windows centered on the TSS. 

While refGene is the prefered annotation, there is also support for other annotations via the following arguments (annotation files are in REFDATA).

-wgrna FILE         # non coding RNAs
-knowngene FILE     # knownGene annotation from Genome Browser
-aceview FILE       # AceView annotation

Note however that for knowngene and aceview, nonsense, missense and synonymous annotation are unreliable (because the annotation does not provide enough information to reconstruct the coding frame).

Note also that annotateSNVs supports SNVs in SAMtools PileUp format, using the -cpileupfile option (instead of -snvfile).


The format of the annotation column is:

Transcript:[refSeq id], [exon id], [gene name], [strand], [CDS=coding sequence|5UTR|3UTR], [amino acid change; x number of reads with this mutation)], [missense|synonymous], [BLOSUM mutation score].

E.g. 
Transcript:NR_024576,13,SEC61A2,1,CDS,Q(CAA)>H(CAT)(4X),MISSENSE,BL62=0

To determine the conservation level (according to PhyloP) of SNVs:

./evaluateGenomicSNVConservation -consdir ./CONSERVATION -snvfile SNVs_withdbSNP_annotated.txt -chrdir REFDATA/ > SNVs_withdbSNP_annotated_withcons.txt

Note:
Use the --gz option if your PhyloP files in CONSERVATION/ are gzip compressed (e.g. chr1.phyloP44way.placental.wigFix.gz).

To determine whether SNVs disrupt or create transcription factor binding sites:

./evaluateGenomicSNVOverlapWithTFBS -snvfile SNVs_withdbSNP_annotated_withcons.txt \
 -fastafile REFDATA/wg.fa \
 -jasparmotifs REFDATA/jasparmotiflist.txt -bulykmotifs REFDATA/bulykmotiflist.txt > SNVs_withdbSNP_annotated_withcons_withTFBS.txt 

where -jasparmotifs and -bulykmotifs are files containing lists of file names, where each file is a motif (jn Jaspar and Bulyk PBM format, resp). These motif files are distributed as part of SNVseeqer, but the lists of file names can be edited (for example, to keep only TFBS of interest).

In the output file (last column), +> indicates TFBS creation and -> indicates disruption.

To extract SNV flanking sequences

(wg.fa must be indexed using samtools faidx)

$SNVSEEQERDIR/GetSNPSubstringIdx -snvfile s_7_sequence.txt.sam.SNVs.dbSNP.annotated.1000genomes.captured \
  -fastafile REFDATA/hg18/wg.fa -chrdata REFDATA/hg18.chrdata -output tab -w 30 

Note: this tool requires that wg.fa be indexed first, using:

$SNVSEEQERDIR/IndexFasta  REFDATA/hg18/wg.fa 

This will create a REFDATA/hg18/wg.fa.fai file. The command needs to be ran only once to create the index.


To determine which of the MIS/NONSENSE mutations are in Uniprot or Cosmic

./FindKnownAAMutations.pl --snvfile=s_5_sequence.txt.sam.SNVs.dbSNP.annotated.1000genomes.captured --anntype=genome 

Option: --showmutonly=1

Visualize specific SNVs:

./FindReadsThatOverlapWithSNP -readfile reads.chr7 -pos 75452954 -chr chr7 -format sam -offset 40

Options:
 -readfile   read alignment file
 -format     sam or eland
 -offset     number of flanking nucleotides shown; offset should be at least (read length - 1).
 -pos        genomic position on the chromosome
0	F	  75452927	       29C	            ............................T.
1	F	  75452928	         -	             ..............................
2	F	  75452929	       27C	              ..........................T...
3	F	  75452932	   19A/24C	                 ..................C....T......
4	R	  75452932	         -	                 ..............................
5	F	  75452936	   20C/28C	                     ...................T.......T..
6	F	  75452937	       19C	                      ..................T...........
7	R	  75452937	         -	                      ..............................
8	F	  75452941	       15C	                          ..............T...............
9	F	  75452941	       15C	                          ..............T...............
10	F	  75452941	         -	                          ............................G.
11	F	  75452942	       14C	                           .............T................
12	R	  75452943	       13C	                            ............T.................
13	R	  75452944	         -	                             ..............................
14	F	  75452944	       12C	                             ...........T..................
15	R	  75452945	         -	                              .............................T
16	F	  75452951	         -	                                    .......................C....T.
17	F	  75452952	        4C	                                     ...T..........................
18	R	  75452955	         -	                                        ..............................
		          	          	CCAACTGGCTGCGGGCCAAGGAGCCTGCCGGGGAGAACGGCGGCCGTGCGCTGGTGCCCATGTTCGTGCGCAAGTCCCAG (REFERENCE)

Create a Genome Browser track from SNV data:

./SNVseeqer2track -snvfile SNVs.txt -trackname "Jurkat-RNA-seq-SNVs"
Personal tools