Elementolab/ChIPseeqerAnnotate
From Icbwiki
Back to Elementolab/ChIPseeqer_Tutorial
ChIPseeqerAnnotate
In this analysis you can:
- ask which peaks overlap with promoters, 3'UTR, exons and introns, and make sublists of these peaks.
- ask which peaks are distal (>2kb and <50kb) and integenic (>50kb) and find the closest 1 or 2 genes
To run the tools directly from any folder, you need to add the $CHIPSEEQERDIR and $CHIPSEEQERDIR/SCRIPTS to your $PATH variable. Read How to set the CHIPSEEQERDIR variable.
1. Type the command:
ChIPseeqerAnnotate --peakfile=TF_targets.txt [ --prefix=TF_targets_ANN --type=GeneParts ]
The following options are available:
--peakfile=FILE file containing genomic regions
--prefix=STR prefix for output files
--type=STR can be GeneParts (default, for all genomes) or RNAGenes (only for hg18)
--genome=STR can be hg18 (human),
mm9 (mouse),
dm3 (drosophila), or
sacser (for Saccharomyces cerevisiae)
--db=STR can be refSeq (available for hg18, mm9, dm3),
AceView (for hg18, mm9),
Ensembl (for hg18, mm9, dm3)
UCSCGenes (for hg18, mm9).
Default is refSeq.
--mindistaway=INT minimum distance away from transcripts, used to define the distal regions. Default is 2000bp
--maxdistal=INT maximum distance away from transcripts, used to define the distal regions. Default is 50000kb
--ext=INT minimum distance away from peaks
--lenuP=INT upstream length of the TSS. Default is 2000bp
--lendP=INT downstream length of the TSS. Default is 2000bp
--lenuDW=INT upstream length of the TES. Default is 2000bp
--lendDW=INT downstream length of the TES. Default is 2000bp
IMPORTANT: Note that in the --peakfile option you must enter a peak file in ChIPseeqer output format.
2. See the results. The main output of this process are three files with the extensions:
_ALL.GP, .GP and .GP.stats
- The files that end with _ALL.* will look like this:
chrY 2867287 2867611 0 chrY 2871627 2871971 2 chrY-2871327-2871629 chrY-2871816-2872114 chrY 2944779 2944956 1 chrY-2944529-2945113 chrY 5642923 5643407 2 chrY-5639836-5643224 chrY-5643229-5643472 chrY 6905840 6906263 0 chrY 6917898 6918357 1 chrY-6918198-6918315 chrY 6945356 6945877 0 chrY 7267607 7267819 1 chrY-7267753-7267949 chrY 7381389 7381767 2 chrY-7381323-7381478 chrY-7381493-7381672 chrY 7652223 7652533 0 chrY 7659894 7660062 1 chrY-7659990-7660187
Each row represents a detected ChIPseeqer peak, whereas the columns indicate:
Chromosome Start_Position End_Position Number_of_peaks_found [peaks_found]
- The file that ends with .GP will look like this:
chrY 2871627 2871971 2 chrY-2871327-2871629 chrY-2871816-2872114 chrY 2944779 2944956 1 chrY-2944529-2945113 chrY 5642923 5643407 2 chrY-5639836-5643224 chrY-5643229-5643472 chrY 6917898 6918357 1 chrY-6918198-6918315 chrY 7267607 7267819 1 chrY-7267753-7267949 chrY 7381389 7381767 2 chrY-7381323-7381478 chrY-7381493-7381672 chrY 7659894 7660062 1 chrY-7659990-7660187
This file is a filtered version of the previous one.
The GeneParts output files include information about the geneparts that overlap with the peaks, with I for Introns, E for Exons and P for Promoters. For example:
chrY 2867287 2867611 3 chrY-2863487-2881949:I-NM_001145276-1 chrY-2863810-2881949:I-NM_003411-1 chrY-2863810-2889114:I-NM_001145275-1 chrY 2871627 2871971 3 chrY-2863487-2881949:I-NM_001145276-1 chrY-2863810-2881949:I-NM_003411-1 chrY-2863810-2889114:I-NM_001145275-1 chrY 5642923 5643407 1 chrY-5509839-5665312:I-NM_032973-4 chrY 6917898 6918357 3 chrY-6840213-6923844:I-NM_134258-2 chrY-6906284-6923844:I-NM_033284-3 chrY-6906284-6949489:I-NM_134259-3 chrY 6945356 6945877 3 chrY-6906284-6949489:I-NM_134259-3 chrY-6923939-6949489:I-NM_033284-4 chrY-6923939-6949489:I-NM_134258-3 chrY 7267607 7267819 1 chrY-7254210-7269155:I-NM_002760-3
- The files that end with .stats summarize statistical information. For the GenParts option the .stats file will look like this:
Number of peaks: 18814 Number of peaks that overlap with gene parts: 12364 (%0.657170192409907) Number of peaks that do not overlap with gene parts: 6450 (%0.342829807590092) Number of peaks that overlap with PROMOTERS only: 1115 (%0.0592643775911555) Number of peaks that overlap with EXONS only: 239 (%0.0127033060486871) Number of peaks that overlap with INTRONS only: 7763 (%0.412618262995642)
The GeneParts analysis also produces promoters, .exons and .introns files that contain the peaks overlapping with each of these entities.
- The .frac file also shows the percentage of peaks in each category, and the percentage of each category in the genome.
fraction_of_peaks fraction_of_genome Promoters 18.6 4.32 Downstream 1.5 3.21 Exons 2.1 0.84 Introns 44.2 35.73 Distal 20.2 16.24 Intergenic 13.4 39.66
3. What can I do next?
You can run the mergeCSAnnotateGenesColumns program to extract specific columns from the .genes.annotated.txt file, and retrieve the genes with peaks in their promoters/exons/introns etc.
On these genes lists you can then perform pathways analysis. See make_PAGE_input.
