Elementolab/ChIPseeqerCluster
From Icbwiki
| Revision as of 16:07, 22 November 2010 Eug2002 (Talk | contribs) ← Previous diff |
Revision as of 20:04, 8 December 2010 Eug2002 (Talk | contribs) Next diff → |
||
| Line 27: | Line 27: | ||
| --norm=INT set to 1 to normalize the matrix | --norm=INT set to 1 to normalize the matrix | ||
| --prefix=STR prefix for output files | --prefix=STR prefix for output files | ||
| + | --genome=STR can be '''hg18''' (human), | ||
| + | '''mm9''' (mouse), | ||
| + | '''dm3''' (drosophila), or | ||
| + | '''sacser''' (for Saccharomyces cerevisiae) | ||
| + | --db=STR can be '''RefGene''' (available for hg18, mm9, dm3), | ||
| + | '''AceView''' (for hg18, mm9), | ||
| + | '''Ensembl''' (for hg18, mm9, dm3) | ||
| + | '''UCSCGenes''' (for hg18, mm9). | ||
| + | Default is RefGene. | ||
Revision as of 20:04, 8 December 2010
Back to Elementolab/ChIPseeqer_Tutorial
ChIPseeqerCluster
In this analysis you can cluster (Hierarchical Clustering is currently available) the detected peaks based on their location in the promoter regions (2kb upstream and 2kb downstream) of the RefSeq genes. The script uses the programs:
- Cluster and
- Java TreeView.
The Cluster program should be downloaded from its webpage and installed in your computer. The TreeView program is included in the ChIPseeqer directory. Note that you should include the Cluster and TreeView directories in the $PATH variable.
To run the tools directly from any folder, you need to add the $CHIPSEEQERDIR and $CHIPSEEQERDIR/SCRIPTS to your $PATH variable. Read How to set the CHIPSEEQERDIR variable.
1. Type the command:
ChIPseeqer2DensityMatrix --targets=TF_targets.txt --lenu=2000 -lend=2000 --prefix=TF_targets_density_TSS_middle_NM --generegion=TSS --norm=0 --chipdir=CHIP
The following options are available:
--targets=FILE file containing genomic regions
--lenu=INT length upstream of genomic region (TSS or TES)
--lend=INT length downstream of genomic region (TSS or TES)
--generegion=STR can be TSS (transcription start site) or TES (transcription end site)
--norm=INT set to 1 to normalize the matrix
--prefix=STR prefix for output files
--genome=STR can be hg18 (human),
mm9 (mouse),
dm3 (drosophila), or
sacser (for Saccharomyces cerevisiae)
--db=STR can be RefGene (available for hg18, mm9, dm3),
AceView (for hg18, mm9),
Ensembl (for hg18, mm9, dm3)
UCSCGenes (for hg18, mm9).
Default is RefGene.
IMPORTANT: Note that in the --targets option you must enter the ChIPseeqer output file.
The output of this process is a .density file. Each line corresponds to a RefSeq transcript. For each 10 nucleotides in the region 2000b upstream to 2000b downstream, the average number of reads is computed. Thus, the .density file will look like this:
NM_007125 3.8 4.0 4.0 3.8 3.0 3.0 2.6 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 NM_004202 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 1.0 1.0 1.0 1.0 1.0 NM_001005852 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 2.6 1.0 NM_001146706 4.0 4.0 4.0 4.0 4.0 3.8 3.0 3.0 3.0 3.0 3.0 4.0 4.0 4.0 4.0 2.0 2.0
The number of columns per line will be (lend+lenu)/10 + 1
IMPORTANT: When --norm=1 option is used a .norm_density file will also be created, that contains the normalized (i.e., between 0 and 1) values.
2. Type the command:
ChIPseeqerCluster --densityfile=TF_targets_density_TSS_middle_NM.density --suf=TF_targets_density_TSS_middle_NM.cluster -distance=2 --linkage=a
The following options are available:
--densityfile=FILE file containing genomic regions
--suf=STR suffix for output files
--distance=INT 1 for uncentered correlation, 2 for Pearson correlation, 3 for uncentered correlation absolute value,
4 for Pearson correlation absolute value, 5 for Spearman's rank correlation, 6 for Kendall's tau,
7 for Euclidean distance, 8 for City-block distance
--linkage=STR m for pairwise complete linkage, s for pairwise single linkage, c for pairwise centroid linkage, a for pairwise average linkage
The output of this process is a .cdt and a .gtr file. In the end, the TreeView program opens and allows you to visualize the result in a heatmap and dendrogram.
3. In order to extract the transcripts per cluster you should type the command:
$ SCRIPTS/hclust2kgg.pl --cdt=TF_targets_density_TSS_middle_NM.cluster.cdt --gtr=TF_targets_density_TSS_middle_NM.cluster.gtr --clusters=8 > 8_clusters.txt
The following options are available:
--cdt=FILE file containing the density matrix information --gtr=STR file containing the hierarchical clustering tree information --clusters=INT number of clusters
IMPORTANT: Note that in the --cdt and the --gtr options you must enter the output files of step 3.
The output of this process will look like this:
GENE EXP NM_005400 0 NM_001142961 1 NM_013234 1 NM_016484 1 NM_145236 2 NR_002970 2 NM_001135047 3 NM_153697 4 NM_153831 5 NM_005607 6 NM_018976 7
Each number (from 0 to 7) indicates one of the eight clusters the script has found in the data.
