Elementolab/ChIPseeqerCluster

From Icbwiki

Jump to: navigation, search

Back to Elementolab/ChIPseeqer_Tutorial

ChIPseeqerCluster

In this analysis you can cluster a matrix that contains either the average density of genomic regions (peaks or promoters), or whatever type of matrix you want.

You can use

IMPORTANT: You need to download the KOHONEN folder from the svn

 svn checkout --username guest --password email@email.com https://pbtech-vc.med.cornell.edu/public/svn/elementolab/KOHONEN/trunk KOHONEN/

and set the KOHONENDIR enviromental variable. How?

The Cluster program should be downloaded from its webpage and installed in your computer. The TreeView program is included in the ChIPseeqer directory. Note that you should include the Cluster and TreeView directories in the $PATH variable.

To run the tools directly from any folder, you need to add the $CHIPSEEQERDIR and $CHIPSEEQERDIR/SCRIPTS to your $PATH variable. Read How to set the CHIPSEEQERDIR variable.

IMPORTANT: To run this program you need to provide a matrix, similar to the one created in ChIPseeqerDensityMatrix or in ChIPseeqerGetReadDensityProfiles. The first column must be an ID (peak identifier or gene/transcript name)

1. Type the command:

ChIPseeqerCluster --file=TF_targets_density_TSS_middle_NM.density --suf=TF_targets_density_TSS_middle_NM.cluster

The following options are available:

--file=FILE file containing genomic regions
--suf=STR          suffix for output files
--type=INT         0 for Cluster, 1 for SOMs
(for Cluster - hierarchical)
--distance=INT     1 for uncentered correlation, 2 for Pearson correlation, 3 for uncentered correlation absolute value, 
                   4 for Pearson correlation absolute value, 5 for Spearman's rank correlation, 6 for Kendall's tau, 
                   7 for Euclidean distance, 8 for City-block distance 
--linkage=STR      m for pairwise complete linkage, s for pairwise single linkage, c for pairwise centroid linkage, a for pairwise average linkage
(for Cluster - k-means)
--distance=INT     1 for uncentered correlation, 2 for Pearson correlation, 3 for uncentered correlation absolute value, 
                   4 for Pearson correlation absolute value, 5 for Spearman's rank correlation, 6 for Kendall's tau, 
                   7 for Euclidean distance, 8 for City-block distance 
--k=STR            number of clusters
--r=STR            how many times we want k-means to run
(for SOMs)
--iter=INT         number of iterations for SOMs
--xdim=INT         the dimensions of the map (number of nodes is -xdim times -ydim). Default is 10 x 10.
--ydim=INT	

The output of this process is a .cdt and a .gtr file. In the end, the TreeView program opens and allows you to visualize the result in a heatmap and dendrogram.

2. (When --type=1 - SOMs - )

The output file will contain the clustering results. Here's an example:

ID  CLUSTER_ID
chr1-146822369-146824369	1
chr1-172648659-172650659	1
chr1-108226941-108228941	2
chr1-24086510-24088510	2
chr1-179319195-179321195	3
chr1-233333747-233335747	4
chrY-11938483-11940483	5
chrY-12319125-12321125	5
chrY-12109703-12111703	5
chrY-10644391-10646391	6
chrY-11928826-11930826	6
chrY-57399708-57401708	6
chrY-57380815-57382815	6
chrY-11935100-11937100	7

If you have defined xdim*ydim = 4*4 = 16 then the CLUSTERS will be 16.

Here's the visual output of the clustering:

SOMs 2D map example

(When --type=0 - Cluster - ) In order to extract the transcripts per cluster you should type the command:

$ SCRIPTS/hclust2kgg.pl --cdt=TF_targets_density_TSS_middle_NM.cluster.cdt --gtr=TF_targets_density_TSS_middle_NM.cluster.gtr --clusters=8 > 8_clusters.txt
Density heatmap example
Enlarge
Density heatmap example

The following options are available:

--cdt=FILE         file containing the density matrix information
--gtr=STR          file containing the hierarchical clustering tree information
--clusters=INT     number of clusters 

IMPORTANT: Note that in the --cdt and the --gtr options you must enter the output files of step 3.

The output of this process will look like this:

GENE	EXP
NM_005400	0
NM_001142961	1
NM_013234	1
NM_016484	1
NM_145236	2
NR_002970	2
NM_001135047	3
NM_153697	4
NM_153831	5
NM_005607	6
NM_018976	7

Each number (from 0 to 7) indicates one of the eight clusters the script has found in the data.

Personal tools