Elementolab/ChIPseeqerCluster

From Icbwiki

Revision as of 16:07, 22 November 2010; view current revision
←Older revision | Newer revision→
Jump to: navigation, search

Back to Elementolab/ChIPseeqer_Tutorial

ChIPseeqerCluster

In this analysis you can cluster (Hierarchical Clustering is currently available) the detected peaks based on their location in the promoter regions (2kb upstream and 2kb downstream) of the RefSeq genes. The script uses the programs:

The Cluster program should be downloaded from its webpage and installed in your computer. The TreeView program is included in the ChIPseeqer directory. Note that you should include the Cluster and TreeView directories in the $PATH variable.

To run the tools directly from any folder, you need to add the $CHIPSEEQERDIR and $CHIPSEEQERDIR/SCRIPTS to your $PATH variable. Read How to set the CHIPSEEQERDIR variable.

1. Type the command:

ChIPseeqer2DensityMatrix --targets=TF_targets.txt --lenu=2000 -lend=2000 --prefix=TF_targets_density_TSS_middle_NM --generegion=TSS --norm=0 --chipdir=CHIP

The following options are available:

--targets=FILE   file containing genomic regions
--lenu=INT       length upstream of genomic region (TSS or TES)
--lend=INT       length downstream of genomic region (TSS or TES)
--generegion=STR can be TSS (transcription start site) or TES (transcription end site)
--norm=INT       set to 1 to normalize the matrix
--prefix=STR     prefix for output files


IMPORTANT: Note that in the --targets option you must enter the ChIPseeqer output file.

The output of this process is a .density file. Each line corresponds to a RefSeq transcript. For each 10 nucleotides in the region 2000b upstream to 2000b downstream, the average number of reads is computed. Thus, the .density file will look like this:

NM_007125	3.8	4.0	4.0	3.8	3.0	3.0	2.6	2.0	2.0	2.0	2.0	2.0	2.0	2.0	2.0	2.0	2.0
NM_004202	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.6	1.0	1.0	1.0	1.0	1.0
NM_001005852	3.0	3.0	3.0	3.0	3.0	3.0	3.0	3.0	3.0	3.0	3.0	3.0	3.0	3.0	3.0	2.6	1.0   
NM_001146706	4.0	4.0	4.0	4.0	4.0	3.8	3.0	3.0	3.0	3.0	3.0	4.0	4.0	4.0	4.0	2.0	2.0	

The number of columns per line will be (lend+lenu)/10 + 1

IMPORTANT: When --norm=1 option is used a .norm_density file will also be created, that contains the normalized (i.e., between 0 and 1) values.

2. Type the command:

ChIPseeqerCluster --densityfile=TF_targets_density_TSS_middle_NM.density --suf=TF_targets_density_TSS_middle_NM.cluster -distance=2 --linkage=a 

The following options are available:

--densityfile=FILE file containing genomic regions
--suf=STR          suffix for output files
--distance=INT     1 for uncentered correlation, 2 for Pearson correlation, 3 for uncentered correlation absolute value, 
                   4 for Pearson correlation absolute value, 5 for Spearman's rank correlation, 6 for Kendall's tau, 
                   7 for Euclidean distance, 8 for City-block distance 
--linkage=STR      m for pairwise complete linkage, s for pairwise single linkage, c for pairwise centroid linkage, a for pairwise average linkage

The output of this process is a .cdt and a .gtr file. In the end, the TreeView program opens and allows you to visualize the result in a heatmap and dendrogram.

3. In order to extract the transcripts per cluster you should type the command:

$ SCRIPTS/hclust2kgg.pl --cdt=TF_targets_density_TSS_middle_NM.cluster.cdt --gtr=TF_targets_density_TSS_middle_NM.cluster.gtr --clusters=8 > 8_clusters.txt
Density heatmap example
Enlarge
Density heatmap example

The following options are available:

--cdt=FILE         file containing the density matrix information
--gtr=STR          file containing the hierarchical clustering tree information
--clusters=INT     number of clusters 

IMPORTANT: Note that in the --cdt and the --gtr options you must enter the output files of step 3.

The output of this process will look like this:

GENE	EXP
NM_005400	0
NM_001142961	1
NM_013234	1
NM_016484	1
NM_145236	2
NR_002970	2
NM_001135047	3
NM_153697	4
NM_153831	5
NM_005607	6
NM_018976	7

Each number (from 0 to 7) indicates one of the eight clusters the script has found in the data.

Personal tools