Elementolab/ChIPseeqerCluster
From Icbwiki
Back to Elementolab/ChIPseeqer_Tutorial
ChIPseeqerCluster
In this analysis you can cluster a matrix that contains either the average density of genomic regions (peaks or promoters), or whatever type of matrix you want.
You can use
- our implementation of Self-organizing Maps (SOMs) OR
- hierarchical clustering using the programs:
- Cluster and
- Java TreeView
IMPORTANT: You need to download the KOHONEN folder from the svn
svn checkout --username guest --password email@email.com https://pbtech-vc.med.cornell.edu/public/svn/elementolab/KOHONEN/trunk KOHONEN/
and set the KOHONENDIR enviromental variable. How?
The Cluster program should be downloaded from its webpage and installed in your computer. The TreeView program is included in the ChIPseeqer directory. Note that you should include the Cluster and TreeView directories in the $PATH variable.
To run the tools directly from any folder, you need to add the $CHIPSEEQERDIR and $CHIPSEEQERDIR/SCRIPTS to your $PATH variable. Read How to set the CHIPSEEQERDIR variable.
IMPORTANT: To run this program you need to provide a matrix, similar to the one created in ChIPseeqerDensityMatrix or in ChIPseeqerGetReadDensityProfiles. The first column must be an ID (peak identifier or gene/transcript name)
1. Type the command:
ChIPseeqerCluster --file=TF_targets_density_TSS_middle_NM.density --suf=TF_targets_density_TSS_middle_NM.cluster
The following options are available:
--file=FILE file containing genomic regions --suf=STR suffix for output files --type=INT 0 for Cluster, 1 for SOMs
(for Cluster - hierarchical)
--distance=INT 1 for uncentered correlation, 2 for Pearson correlation, 3 for uncentered correlation absolute value,
4 for Pearson correlation absolute value, 5 for Spearman's rank correlation, 6 for Kendall's tau,
7 for Euclidean distance, 8 for City-block distance
--linkage=STR m for pairwise complete linkage, s for pairwise single linkage, c for pairwise centroid linkage, a for pairwise average linkage
(for Cluster - k-means)
--distance=INT 1 for uncentered correlation, 2 for Pearson correlation, 3 for uncentered correlation absolute value,
4 for Pearson correlation absolute value, 5 for Spearman's rank correlation, 6 for Kendall's tau,
7 for Euclidean distance, 8 for City-block distance
--k=STR number of clusters
--r=STR how many times we want k-means to run
(for SOMs) --iter=INT number of iterations for SOMs --xdim=INT the dimensions of the map (number of nodes is -xdim times -ydim). Default is 10 x 10. --ydim=INT
The output of this process is a .cdt and a .gtr file. In the end, the TreeView program opens and allows you to visualize the result in a heatmap and dendrogram.
2. (When --type=1 - SOMs - )
The output file will contain the clustering results. Here's an example:
ID CLUSTER_ID
chr1-146822369-146824369 1 chr1-172648659-172650659 1 chr1-108226941-108228941 2 chr1-24086510-24088510 2 chr1-179319195-179321195 3 chr1-233333747-233335747 4 chrY-11938483-11940483 5 chrY-12319125-12321125 5 chrY-12109703-12111703 5 chrY-10644391-10646391 6 chrY-11928826-11930826 6 chrY-57399708-57401708 6 chrY-57380815-57382815 6 chrY-11935100-11937100 7
If you have defined xdim*ydim = 4*4 = 16 then the CLUSTERS will be 16.
Here's the visual output of the clustering:
(When --type=0 - Cluster - ) In order to extract the transcripts per cluster you should type the command:
$ SCRIPTS/hclust2kgg.pl --cdt=TF_targets_density_TSS_middle_NM.cluster.cdt --gtr=TF_targets_density_TSS_middle_NM.cluster.gtr --clusters=8 > 8_clusters.txt
The following options are available:
--cdt=FILE file containing the density matrix information --gtr=STR file containing the hierarchical clustering tree information --clusters=INT number of clusters
IMPORTANT: Note that in the --cdt and the --gtr options you must enter the output files of step 3.
The output of this process will look like this:
GENE EXP NM_005400 0 NM_001142961 1 NM_013234 1 NM_016484 1 NM_145236 2 NR_002970 2 NM_001135047 3 NM_153697 4 NM_153831 5 NM_005607 6 NM_018976 7
Each number (from 0 to 7) indicates one of the eight clusters the script has found in the data.
