From Icbwiki

Jump to: navigation, search

Back to Elementolab/ChIPseeqer_Tutorial


This analysis allows you to search for informative regulatory elements (motifs) among the detected ChIPseq peaks.

To run the tools directly from any folder, you need to add the $CHIPSEEQERDIR, $CHIPSEEQERDIR/SCRIPTS and $FIREDIR to your $PATH variable. Read How to set the CHIPSEEQERDIR variable.

1. The first step in this analysis is to download the entire human genome sequence, preferably from the UCSC Genome Browser website. You will also need to download the pre-packaged sequence file(s) for your organism(s).

Here is how you should do it:

1. To download the entire human genome sequence:

  • create a new directory inside ChIPseeqer/:
mkdir hg18
cd hg18
  • download genome (wget provides an easy way to do so, but you can also download the file manually and save it in hg18/):
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/chromFa.zip
  • unzip
unzip chromFa.zip
  • concatenate all chromosome files:
cat *.fa > wg.fa
  • erase chromosome files:
rm chr*.fa
  • index file
IndexFasta wg.fa

Note: This uses the samtools index function so you may skip this step if you have indexed your sequences using "samtools index".

2. Make sure that you've downloaded the species data when you installed FIRE.

If not, go to the main FIRE directory (cd $FIREDIR) and download the pre-packaged sequence file(s) for your organism:'

For example, for human:

wget https://tavazoielab.c2b2.columbia.edu/FIRE/human_data.zip

Then unzip it

unzip human_data.zip

now go back to the directory containing your ChIP-seq files.

3. The next step is to make the required FIRE input files, using the ChIPseeqer output file (e.g., TF_targets.txt) as input and run fire.

To do this, type the following command:

--peakfile=FILE 	File with ChIP-seq peaks.
--peaksfolder=DIR	Instead of peakfile, use this option to define a directory that contains several peakfiles. This will run FIRE using as many categories as the peakfiles. Useful to compare multiple peak lists motif-wise.
--fastafile=INT 	Genome file (wg.fa) - must download it, instructions at INSTALL file, section DATA.
--randmode=STR	        "adjacent" Selects immediately adjacent sequences.
                        "1MM" Shuffles the original peak sequences.
                        "CGI" Keeps the % of peaks that overlap with CpG islands. 
                        "random" Random sequences on the same chromosome.
                        Define how to select random sequences. Default is "random". 
--genome=STR    	hg19 (human)
                       	hg18 (human)
                       	mm10 (mouse)
                       	mm9 (mouse)
                       	rn4 (rat)
                       	dm3 (drosophila)
                       	sacser (Saccharomyces cerevisiae)
--species=STR   	human
                        Use this option to match the discovered motifs against a database of known motifs.
                        Need to download first the corresponding data from https://tavazoielab.c2b2.columbia.edu/FIRE/.
--seed=INT  		When set to 1, FIRE runs with default seed for random intervals. Useful if you want to keep same random regions for different FIRE runs. Default is 1.
--runit=INT		When set to 1, FIRE runs directly. If set to 0 you need to run FIRE manually. Default is 1.
--k=INT			Define the size of the motifs to be discovered (values 3-8). Default is 7.
--verbose=INT   	Verbose mode. Default is 0.

This command will create two files: TF_targets.txt.FIRE.txt and TF_targets.txt.FIRE.seq. By default, ChIPseeqerFIRE runs FIRE for you. However, if you'd like to run FIRE manually, you can use TF_targets.txt.FIRE.txt and TF_targets.txt.FIRE.seq as input (step 4).

IMPORTANT: To install FIRE see https://tavazoielab.c2b2.columbia.edu/FIRE/tutorial.html Don't forget to set the $FIREDIR variable and add it to your $PATH.

4. If you want to run FIRE independently of the ChIPseeqerFIRE script, type the following command:

fire.pl --expfile=TF_targets.txt.FIRE.txt --fastafile_dna=TF_targets.txt.FIRE.seq --nodups=1 --minr=2 --species=human --dorna=0 --dodnarna=0

The -- expfile and --fastafile_dna options must be the .txt and .seq files respectively that have been created by ChIPseeqerFIRE.

The following options are available:

-jn_t=INT                      between 0 and 10, define the robustness index threshold (default is 6)
-k=INT                         defines the length of the k-mer seeds (default is 7)
-expfiles="*.txt"              processes all .txt expression files in current directory
-sortmotifbyphase=1            Skip module discovery, sort motifs by phase (useful only for certain continuous expression profiles)
-dodna=0                       Skip the DNA combination phase (useful to redo only parts of the analysis)
-dorna=0                       Skip the RNA combination phase (useful to redo only parts of the analysis)
-dodnarna=0                    Skip the DNA/RNA combination phase (useful to redo only parts of the analysis)

IMPORTANT: You must define the --species option in order to get the motif names

4. See the results.

The previous step will create a folder with the "FIRE" extension.

Inside this folder you will find the .summary.pdf file which is the most important part of the FIRE analysis.

Example of FIRE output
Example of FIRE output

5. What can I do next ?


If your peak list is > 100,000 lines

1. Go to ChIPseeqer/tools/FIRE/PROGRAMS
Open dataio.c
Go to line 253, should be:   int   mynmax = 100000;
Change this number to a larger one, if your list of peaks is > 100000 

2. Open mi_find.c
Go to lines 466 and 468
Change 100,000 to a larger number
For example, 
count_seq_index = (char*)calloc(250000,  sizeof(char));
count_seq ++; if (count_seq == 250000) die("pb with count_seq > 250000\n");

3. Open mi_signif.c
Go to lines 301 and 304
Change 100,000 to a larger number
For example, 
count_seq_index = (char*)calloc(250000,  sizeof(char));
count_seq ++; if (count_seq == 250000) die("pb with count_seq > 250000\n");

4. Go back to ChIPseeqer/src
Run "make tools"
Personal tools