Back to Elementolab/
This script offers an integrated analysis of gene expression (RNA-seq) and TF binding (ChIP-seq) in terms of predictive modeling.
In other words, this analysis constructs predictive models of gene expression profile based on several ChIP-seq TF data.
The steps of the analysis include:
1. feature extraction from ChIP-seq data
2. dimensionality reduction using Principal Component Analysis
3. linear regression modeling of principal components
Background and Description
The script takes as input a file (coreTFFile) with the names of peakfiles (i.e., typical output of ChIPseeqer) followed by a label (see coreTFFile example that follows) and a file (rpkmFile) with RPKM values, and estimates the R-square of the model (using as response of the model the RPKM values given).
The R-square value shows the prediction power. In other words, if R-square is 0.912, it means that 91.2 of the total response variation (e.g., LY1 gene expression) is due to the linear association between the variables (e.g., the TFs).
If the rpkmFile has many columns each indicating a different condition (e.g., CB, NB, LY1), the response of the model can be specifically defined (see below --resp option).
What else can I do?
You can add another file (suppTFFile) of the same format with coreTFFile that includes the peakfiles of other TFs (from other labs or a repository), and run the same analysis in two different ways:
1. --suppTFMode=all This mode "adds" all the supp TFs to your core TFs and runs the analysis. In the end you can see how much the R-square is affected when adding many more TFs.
2. --suppTFMode=onebyone This mode "adds" one supp TF at a time to your core TFs and runs the analysis. In the end you can see how much the R-square is affected when adding one new TF at a time.
STEP 1: From each of the TF peakfiles included in the coreTFFile, we estimate for each transcript(gene) a score that is a function of a peak's distance to the TSS of the transctipt and a peak's height. We combine all these scores, from all TFs for all trascripts.
STEP 2: The result of this integration is a vector of scores for each transcript. PCA is then performed to extract uncorrelated variables (principal components) with high variance.
STEP 3: Linear regression is the step that follows, using the TF components extracted before as predictors and the RPKM values as the response. With the regression we can estimate how well the predictors capture the predictable variations in the gene expression.
You must download the BCELLS dir from svn. Here's how you can do that:
svn checkout --username guest --password firstname.lastname@example.org https://pbtech-vc.med.cornell.edu/public/svn/elementolab/BCELLS/trunk BCELLS/ svn checkout --username guest --password email@example.com https://pbtech-vc.med.cornell.edu/public/svn/elementolab/BIO-C/trunk BIO-C
You must download the PERL_MODULES dir from svn. Here's how you can do that:
svn checkout --username guest --password firstname.lastname@example.org https://pbtech-vc.med.cornell.edu/public/svn/elementolab/PERL_MODULES/trunk PERL_MODULES/
You must download the PERL_SCRIPTS dir from svn. Here's how you can do that:
svn checkout --username guest --password email@example.com https://pbtech-vc.med.cornell.edu/public/svn/elementolab/PERL_SCRIPTS/trunk PERL_SCRIPTS/
You must download the HEATMAP dir from svn. Here's how you can do that:
svn checkout --username guest --password firstname.lastname@example.org https://pbtech-vc.med.cornell.edu/public/svn/elementolab/HEATMAP/trunk HEATMAP/
You will also need R. To download the R dir from svn you can do:
svn checkout https://svn.r-project.org/R/trunk/ R_HOME
You must have set the $CHIPSEEQERDIR environmental variable and add it to the PATH. How to do that?
You must have set the $CHIPPCADIR environmental variable (make it to point to BCELLS directory) and add it to the PATH. How to do that?
You must have set the $PERLSCRIPTSDIR environmental variable (make it to point to PERL_SCRIPTS directory) and add it to the PATH. How to do that?
You must have set the $HEATMAPDIR environmental variable (make it to point to HEATMAP directory) and add it to the PATH. How to do that?
You also need R, and the GNU scientific library: must also add the LD_LIBRARY to the PATH
How to run the script
To run the script directly from any folder, you need to add the $CHIPSEEQERDIR and $CHIPSEEQERDIR/SCRIPTS to your $PATH. Read How to set the CHIPSEEQERDIR variable.
1. To run the program, type the command:
ChIPseeqerModel --coreTFFile=Core_TF_files_labels.txt --calcScores=0 --rpkmFile=LY1_LY7_CB_NB_RPKM.txt --suppTFFile=all_files.txt \ --suppTFMode=onebyone --anova=1 --peakheightcol=4
The following options are available:
-coreTFFile FILE This file contains the names and labels for all the TFs under study. An example of such a file follows. -suppTFFile FILE This file contains the names and labels for all the supplementary TFs. An example of such a file follows. -d0 INT Constant that defines how close the peak is to the TSS -peakheightcol INT Indicates the number of column in the peakfile that has the peak height (the peak height is used to estimate the score) -calcScores INT Either 0 or 1. If set to 0 skips the part of estimating the scores for the peakfiles in coreTFFile (this means that you already have the .PB files created from this step). Default is set to 1. -pcaCenter INT -pcaScale INT -rpkmFile FILE This file contains the gene expression values (e.g., RPKM). -response STR Indicates the name of the column from the rpkmFile that will be the response of the model. For example, "log\"(LY1+1)\"" -suppTFMode STR Can be either "all" or "onebyone". Default is "all". -skipPol2 INT Either 0 or 1. If set to 1 skips all peakfils from suppTFFile that contain the "Pol2" annotation. Default is set to 1. -anova INT Either 0 or 1. If set to 1 performs anova and reports id the differences in R-square are significant. The Bonferroni correction is also performed in this step. Default is set to 0.
Example of the coreTFFile and suppTFFile.
Bcl6Ly1_peaks.txt.PB Bcl6Ly1 BCOR_peaks.txt.PB BCOR CTCF_peaks.txt.PB CTCF h3k4me1_peaks.txt.PB h3k4me1 h3k4me3_peaks.txt.PB h3k4me3
IMPORTANT: The coreTFFile and suppTFFile must be in the same directory with the peakfiles (Bcl6Ly1_peaks.txt, BCOR_peaks.txt etc).
Example of the rpkmFile (Simple format: Transcript_ID RPKM_value)
ID LY1 LY7 CB NB NM_021648 13.6 14.4 14.9 10.5 NM_020351 0.0 0.0 0.0 0.0 NM_001112734 1.4 1.6 3.6 3.4 NM_002697 21.9 28.1 12.7 7.0 NM_002656 0.1 0.0 0.7 0.8
What do the results look like?
LABEL R-SQ p-value Sign.symbol Corr.p-value CL6_K9ac 0.616 1.71E-09 *** 0.00000011 K4me1_K9Ac 0.616 7.95E-09 *** 0.00000053 K79ac_K4me1 0.616 5.55E-09 *** 0.00000037 MTA3_K79ac 0.616 9.68E-09 *** 0.00000065 BCL6_K4me3 0.6159 2.33E-08 *** 0.00000156
Can I extract the trascripts for each component?
You can use the script ChIPseeqerGetPCGenes.
- The ChIPseeqer Model tools used independently