Elementolab/ChIPseeqerAnnotate

From Icbwiki

Revision as of 22:26, 26 May 2011; view current revision
←Older revision | Newer revision→
Jump to: navigation, search

Back to Elementolab/ChIPseeqer_Tutorial

ChIPseeqerAnnotate

In this analysis you can:

  • ask which peaks overlap with promoters, 3'UTR, exons and introns, and make sublists of these peaks.
  • ask which peaks are distal (>2kb and <50kb) and integenic (>50kb) and find the closest 1 or 2 genes

To run the tools directly from any folder, you need to add the $CHIPSEEQERDIR and $CHIPSEEQERDIR/SCRIPTS to your $PATH variable. Read How to set the CHIPSEEQERDIR variable.

1. Type the command:

ChIPseeqerAnnotate --peakfile=TF_targets.txt [ --prefix=TF_targets_ANN --type=GeneParts ]

The following options are available:

--peakfile=FILE    file containing genomic regions
--prefix=STR       prefix for output files
--type=STR         can be GeneParts (default, for all genomes) or RNAGenes (only for hg18)
--genome=STR       can be hg18 (human),
                   mm9 (mouse),
                   dm3 (drosophila), or
                   sacser (for Saccharomyces cerevisiae)
--db=STR           can be refSeq (available for hg18, mm9, dm3), 
                   AceView (for hg18, mm9), 
                   Ensembl (for hg18, mm9, dm3)
                   UCSCGenes (for hg18, mm9). 
                   Default is refSeq.
--mindistaway=INT  minimum distance away from transcripts, used to define the distal regions. Default is 2000bp
--maxdistal=INT    maximum distance away from transcripts, used to define the distal regions. Default is 50000kb 
--ext=INT          minimum distance away from peaks
--lenuP=INT	   upstream length of the TSS. Default is 2000bp
--lendP=INT	   downstream length of the TSS. Default is 2000bp
--lenuDW=INT	   upstream length of the TES. Default is 2000bp
--lendDW=INT	   downstream length of the TES. Default is 2000bp


IMPORTANT: Note that in the --peakfile option you must enter a peak file in ChIPseeqer output format.

2. See the results. The main output of this process are three files with the extensions:

_ALL.GP, .GP and .GP.stats
  • The files that end with _ALL.* will look like this:
chrY	2867287	2867611	0
chrY	2871627	2871971	2	chrY-2871327-2871629	chrY-2871816-2872114
chrY	2944779	2944956	1	chrY-2944529-2945113
chrY	5642923	5643407	2	chrY-5639836-5643224	chrY-5643229-5643472
chrY	6905840	6906263	0
chrY	6917898	6918357	1	chrY-6918198-6918315
chrY	6945356	6945877	0
chrY	7267607	7267819	1	chrY-7267753-7267949
chrY	7381389	7381767	2	chrY-7381323-7381478	chrY-7381493-7381672
chrY	7652223	7652533	0
chrY	7659894	7660062	1	chrY-7659990-7660187

Each row represents a detected ChIPseeqer peak, whereas the columns indicate:

Chromosome	Start_Position	End_Position	Number_of_peaks_found	[peaks_found]
  • The file that ends with .GP will look like this:
chrY	2871627	2871971	2	chrY-2871327-2871629	chrY-2871816-2872114
chrY	2944779	2944956	1	chrY-2944529-2945113
chrY	5642923	5643407	2	chrY-5639836-5643224	chrY-5643229-5643472
chrY	6917898	6918357	1	chrY-6918198-6918315
chrY	7267607	7267819	1	chrY-7267753-7267949
chrY	7381389	7381767	2	chrY-7381323-7381478	chrY-7381493-7381672
chrY	7659894	7660062	1	chrY-7659990-7660187

This file is a filtered version of the previous one.

The GeneParts output files include information about the geneparts that overlap with the peaks, with I for Introns, E for Exons and P for Promoters. For example:

chrY	2867287	2867611	3	chrY-2863487-2881949:I-NM_001145276-1	chrY-2863810-2881949:I-NM_003411-1	chrY-2863810-2889114:I-NM_001145275-1
chrY	2871627	2871971	3	chrY-2863487-2881949:I-NM_001145276-1	chrY-2863810-2881949:I-NM_003411-1	chrY-2863810-2889114:I-NM_001145275-1
chrY	5642923	5643407	1	chrY-5509839-5665312:I-NM_032973-4
chrY	6917898	6918357	3	chrY-6840213-6923844:I-NM_134258-2	chrY-6906284-6923844:I-NM_033284-3	chrY-6906284-6949489:I-NM_134259-3
chrY	6945356	6945877	3	chrY-6906284-6949489:I-NM_134259-3	chrY-6923939-6949489:I-NM_033284-4	chrY-6923939-6949489:I-NM_134258-3
chrY	7267607	7267819	1	chrY-7254210-7269155:I-NM_002760-3
  • The files that end with .stats summarize statistical information. For the GenParts option the .stats file will look like this:
Number of peaks: 	 18814
Number of peaks that overlap with gene parts: 	 12364 	 (%0.657170192409907) 
Number of peaks that do not overlap with gene parts: 	 6450 	 (%0.342829807590092) 
Number of peaks that overlap with PROMOTERS only: 	 1115 	 (%0.0592643775911555) 
Number of peaks that overlap with EXONS only: 	 239 	 (%0.0127033060486871) 
Number of peaks that overlap with INTRONS only: 	 7763 	 (%0.412618262995642)

The GeneParts analysis also produces promoters, .exons and .introns files that contain the peaks overlapping with each of these entities.

  • The .frac file also shows the percentage of peaks in each category, and the percentage of each category in the genome.
           fraction_of_peaks fraction_of_genome
Promoters	  18.6	        4.32
Downstream	  1.5	        3.21
Exons	          2.1	        0.84
Introns	          44.2	        35.73
Distal	          20.2	        16.24
Intergenic	  13.4	        39.66


3. What can I do next?

You can run the mergeCSAnnotateGenesColumns program to extract specific columns from the .genes.annotated.txt file, and retrieve the genes with peaks in their promoters/exons/introns etc.

On these genes lists you can then perform pathways analysis. See make_PAGE_input.

Personal tools