Elementolab/ChIPseeqerAnnotate

From Icbwiki

Jump to: navigation, search

Back to Elementolab/ChIPseeqer_Tutorial

ChIPseeqerAnnotate

In this analysis you can:

  • ask which peaks overlap with promoters, 3'UTR, exons and introns, and make sublists of these peaks.
  • ask which peaks are distal (>2kb and <50kb) and integenic (>50kb) and find the closest 1 or 2 genes

To run the tools directly from any folder, you need to add the $CHIPSEEQERDIR and $CHIPSEEQERDIR/SCRIPTS to your $PATH variable. Read How to set the CHIPSEEQERDIR variable.

1. Type the command:

ChIPseeqerAnnotate --peakfile=TF_targets.txt [ --prefix=TF_targets_ANN --type=GeneParts ]

The following options are available:

--peakfile=FILE File with ChIP-seq peaks.
--lenuP=INT     Define the length upstream of TSS. Default is 2000bp.
--lendP=INT     Define the length downstream of TSS. Default is 2000bp.
--lenuDW=INT    Define the length upstream of TES. Default is 2000bp.
--lendDW=INT    Define the length downstream of TES. Default is 2000bp.
--genome=STR    hg19 (human)
                hg18 (human)
                mm10 (mouse)
                mm9 (mouse)
                rn4 (rat)
                dm3 (drosophila)
                sacser (Saccharomyces cerevisiae)
--db=STR        refSeq (available for hg19, hg18, mm10, mm9, rn4, dm3)
                AceView (for hg19, hg18, mm9)
                Ensembl (for hg19, hg18, mm10, mm9, rn4, dm3)
                UCSCGenes (for hg19, hg18, mm10, mm9).
                Default is refSeq.
--mindistaway=INT  Define minimum distance away from transcripts, used to define the distal regions. Default is 2000bp.
--maxdistal=INT Define maximum distance away from transcripts, used to define the distal regions. Default is 50000kb.
--verbose=INT   Verbose mode. Default is 0.

IMPORTANT: Note that in the --peakfile option you must enter a peak file in ChIPseeqer output format.

2. See the results. The main output of this process are three files with the extensions:

_ALL.GP, .GP and .GP.stats
  • The files that end with _ALL.* will look like this:
chrY	2867287	2867611	0
chrY	2871627	2871971	2	chrY-2871327-2871629	chrY-2871816-2872114
chrY	2944779	2944956	1	chrY-2944529-2945113
chrY	5642923	5643407	2	chrY-5639836-5643224	chrY-5643229-5643472
chrY	6905840	6906263	0
chrY	6917898	6918357	1	chrY-6918198-6918315
chrY	6945356	6945877	0
chrY	7267607	7267819	1	chrY-7267753-7267949
chrY	7381389	7381767	2	chrY-7381323-7381478	chrY-7381493-7381672
chrY	7652223	7652533	0
chrY	7659894	7660062	1	chrY-7659990-7660187

Each row represents a detected ChIPseeqer peak, whereas the columns indicate:

Chromosome	Start_Position	End_Position	Number_of_peaks_found	[peaks_found]
  • The file that ends with .GP will look like this:
chrY	2871627	2871971	2	chrY-2871327-2871629	chrY-2871816-2872114
chrY	2944779	2944956	1	chrY-2944529-2945113
chrY	5642923	5643407	2	chrY-5639836-5643224	chrY-5643229-5643472
chrY	6917898	6918357	1	chrY-6918198-6918315
chrY	7267607	7267819	1	chrY-7267753-7267949
chrY	7381389	7381767	2	chrY-7381323-7381478	chrY-7381493-7381672
chrY	7659894	7660062	1	chrY-7659990-7660187

This file is a filtered version of the previous one.

The GeneParts output files include information about the geneparts that overlap with the peaks, with I for Introns, E for Exons and P for Promoters. For example:

chrY	2867287	2867611	3	chrY-2863487-2881949:I-NM_001145276-1	chrY-2863810-2881949:I-NM_003411-1	chrY-2863810-2889114:I-NM_001145275-1
chrY	2871627	2871971	3	chrY-2863487-2881949:I-NM_001145276-1	chrY-2863810-2881949:I-NM_003411-1	chrY-2863810-2889114:I-NM_001145275-1
chrY	5642923	5643407	1	chrY-5509839-5665312:I-NM_032973-4
chrY	6917898	6918357	3	chrY-6840213-6923844:I-NM_134258-2	chrY-6906284-6923844:I-NM_033284-3	chrY-6906284-6949489:I-NM_134259-3
chrY	6945356	6945877	3	chrY-6906284-6949489:I-NM_134259-3	chrY-6923939-6949489:I-NM_033284-4	chrY-6923939-6949489:I-NM_134258-3
chrY	7267607	7267819	1	chrY-7254210-7269155:I-NM_002760-3
  • The files that end with .stats summarize statistical information. For the GeneParts option the .stats file will look like this:
Number of peaks: 	 18814
Number of peaks that overlap with gene parts: 	 12364 	 (%0.657170192409907) 
Number of peaks that do not overlap with gene parts: 	 6450 	 (%0.342829807590092) 
Number of peaks that overlap with PROMOTERS only: 	 1115 	 (%0.0592643775911555) 
Number of peaks that overlap with EXONS only: 	 239 	 (%0.0127033060486871) 
Number of peaks that overlap with INTRONS only: 	 7763 	 (%0.412618262995642)

The GeneParts analysis also produces promoters, .exons and .introns files that contain the peaks overlapping with each of these entities.

  • The .frac file also shows the percentage of peaks in each category, and the percentage of each category in the genome.
           fraction_of_peaks fraction_of_genome
Promoters	  18.6	        4.32
Downstream	  1.5	        3.21
Exons	          2.1	        0.84
Introns	          44.2	        35.73
Distal	          20.2	        16.24
Intergenic	  13.4	        39.66


3. What can I do next?

You can run the mergeCSAnnotateGenesColumns program to extract specific columns from the .genes.annotated.txt file, and retrieve the genes with peaks in their promoters/exons/introns etc.

On these genes lists you can then perform pathways analysis. See make_PAGE_input.

Personal tools