Elementolab/ChIPseeqerModel

From Icbwiki

Jump to: navigation, search

Back to Elementolab/

Contents

ChIPseeqer Model

This script offers an integrated analysis of gene expression (RNA-seq) and TF binding (ChIP-seq) in terms of predictive modeling.

In other words, this analysis constructs predictive models of gene expression profile based on several ChIP-seq TF data.

The steps of the analysis include:

1. feature extraction from ChIP-seq data

2. dimensionality reduction using Principal Component Analysis

3. linear regression modeling of principal components

Background and Description

The script takes as input a file (coreTFFile) with the names of peakfiles (i.e., typical output of ChIPseeqer) followed by a label (see coreTFFile example that follows) and a file (rpkmFile) with RPKM values, and estimates the R-square of the model (using as response of the model the RPKM values given).

The R-square value shows the prediction power. In other words, if R-square is 0.912, it means that 91.2 of the total response variation (e.g., LY1 gene expression) is due to the linear association between the variables (e.g., the TFs).

If the rpkmFile has many columns each indicating a different condition (e.g., CB, NB, LY1), the response of the model can be specifically defined (see below --resp option).

What else can I do?

You can add another file (suppTFFile) of the same format with coreTFFile that includes the peakfiles of other TFs (from other labs or a repository), and run the same analysis in two different ways:

1. --suppTFMode=all This mode "adds" all the supp TFs to your core TFs and runs the analysis. In the end you can see how much the R-square is affected when adding many more TFs.

2. --suppTFMode=onebyone This mode "adds" one supp TF at a time to your core TFs and runs the analysis. In the end you can see how much the R-square is affected when adding one new TF at a time.


STEP 1: From each of the TF peakfiles included in the coreTFFile, we estimate for each transcript(gene) a score that is a function of a peak's distance to the TSS of the transctipt and a peak's height. We combine all these scores, from all TFs for all trascripts.

STEP 2: The result of this integration is a vector of scores for each transcript. PCA is then performed to extract uncorrelated variables (principal components) with high variance.

STEP 3: Linear regression is the step that follows, using the TF components extracted before as predictors and the RPKM values as the response. With the regression we can estimate how well the predictors capture the predictable variations in the gene expression.

Requirements

You must download the BCELLS dir from svn. Here's how you can do that:

svn checkout --username guest --password email@email.com https://pbtech-vc.med.cornell.edu/public/svn/elementolab/BCELLS/trunk BCELLS/
svn checkout --username guest --password email@email.com https://pbtech-vc.med.cornell.edu/public/svn/elementolab/BIO-C/trunk BIO-C

You must download the PERL_MODULES dir from svn. Here's how you can do that:

svn checkout --username guest --password email@email.com https://pbtech-vc.med.cornell.edu/public/svn/elementolab/PERL_MODULES/trunk PERL_MODULES/

You must download the PERL_SCRIPTS dir from svn. Here's how you can do that:

svn checkout --username guest --password email@email.com https://pbtech-vc.med.cornell.edu/public/svn/elementolab/PERL_SCRIPTS/trunk PERL_SCRIPTS/

You must download the HEATMAP dir from svn. Here's how you can do that:

svn checkout --username guest --password email@email.com https://pbtech-vc.med.cornell.edu/public/svn/elementolab/HEATMAP/trunk HEATMAP/

You will also need R. To download the R dir from svn you can do:

svn checkout https://svn.r-project.org/R/trunk/ R_HOME

You must have set the $CHIPSEEQERDIR environmental variable and add it to the PATH. How to do that?

You must have set the $CHIPPCADIR environmental variable (make it to point to BCELLS directory) and add it to the PATH. How to do that?

You must have set the $PERLSCRIPTSDIR environmental variable (make it to point to PERL_SCRIPTS directory) and add it to the PATH. How to do that?

You must have set the $HEATMAPDIR environmental variable (make it to point to HEATMAP directory) and add it to the PATH. How to do that?

You also need R, and the GNU scientific library: must also add the LD_LIBRARY to the PATH

How to run the script

To run the script directly from any folder, you need to add the $CHIPSEEQERDIR and $CHIPSEEQERDIR/SCRIPTS to your $PATH. Read How to set the CHIPSEEQERDIR variable.

1. To run the program, type the command:

ChIPseeqerModel --coreTFFile=Core_TF_files_labels.txt --calcScores=0 --rpkmFile=LY1_LY7_CB_NB_RPKM.txt --suppTFFile=all_files.txt \
  --suppTFMode=onebyone --anova=1 --peakheightcol=4

The following options are available:

-coreTFFile FILE       This file contains the names and labels for all the TFs under study. An example of such a file follows.
-suppTFFile FILE       This file contains the names and labels for all the supplementary TFs. An example of such a file follows.
-d0 INT                Constant that defines how close the peak is to the TSS
-peakheightcol INT     Indicates the number of column in the peakfile that has the peak height (the peak height is used to estimate the score)
-calcScores INT        Either 0 or 1. If set to 0 skips the part of estimating the scores for the peakfiles in coreTFFile (this means that you already 
                       have the .PB files created from this step). Default is set to 1. 
-pcaCenter INT         
-pcaScale INT
-rpkmFile FILE         This file contains the gene expression values (e.g., RPKM). 
-response STR          Indicates the name of the column from the rpkmFile that will be the response of the model. For example, "log\"(LY1+1)\""
-suppTFMode STR        Can be either "all" or "onebyone". Default is "all". 
-skipPol2 INT          Either 0 or 1. If set to 1 skips all peakfils from suppTFFile that contain the "Pol2" annotation. Default is set to 1.
-anova INT             Either 0 or 1. If set to 1 performs anova and reports id the differences in R-square are significant. 
                       The Bonferroni correction is also performed in this step. Default is set to 0.

Example of the coreTFFile and suppTFFile.

Bcl6Ly1_peaks.txt.PB	Bcl6Ly1
BCOR_peaks.txt.PB	BCOR
CTCF_peaks.txt.PB	CTCF
h3k4me1_peaks.txt.PB	h3k4me1
h3k4me3_peaks.txt.PB	h3k4me3

IMPORTANT: The coreTFFile and suppTFFile must be in the same directory with the peakfiles (Bcl6Ly1_peaks.txt, BCOR_peaks.txt etc).

Example of the rpkmFile (Simple format: Transcript_ID RPKM_value)

ID	LY1	LY7	CB	NB
NM_021648	13.6	14.4	14.9	10.5
NM_020351	0.0	0.0	0.0	0.0
NM_001112734	1.4	1.6	3.6	3.4
NM_002697	21.9	28.1	12.7	7.0
NM_002656	0.1	0.0	0.7	0.8

What do the results look like?

LABEL	        R-SQ	p-value	   Sign.symbol	Corr.p-value
CL6_K9ac	0.616	1.71E-09	***	0.00000011
K4me1_K9Ac	0.616	7.95E-09	***	0.00000053
K79ac_K4me1	0.616	5.55E-09	***	0.00000037
MTA3_K79ac	0.616	9.68E-09	***	0.00000065
BCL6_K4me3	0.6159	2.33E-08	***	0.00000156

Can I extract the trascripts for each component?

You can use the script ChIPseeqerGetPCGenes.

  • The ChIPseeqer Model tools used independently
Personal tools