Back to Elementolab/ChIPseeqer_Tutorial
This analysis allows you to search for informative regulatory elements (motifs) among the detected ChIPseq peaks.
To run the tools directly from any folder, you need to add the $CHIPSEEQERDIR, $CHIPSEEQERDIR/SCRIPTS and $FIREDIR to your $PATH variable. Read How to set the CHIPSEEQERDIR variable.
1. The first step in this analysis is to download the entire human genome sequence, preferably from the UCSC Genome Browser website. You will also need to download the pre-packaged sequence file(s) for your organism(s).
Here is how you should do it:
1. To download the entire human genome sequence:
- create a new directory inside ChIPseeqer/:
mkdir hg18 cd hg18
- download genome (wget provides an easy way to do so, but you can also download the file manually and save it in hg18/):
- concatenate all chromosome files:
cat *.fa > wg.fa
- erase chromosome files:
- index file
Note: This uses the samtools index function so you may skip this step if you have indexed your sequences using "samtools index".
2. Make sure that you've downloaded the species data when you installed FIRE.
If not, go to the main FIRE directory (cd $FIREDIR) and download the pre-packaged sequence file(s) for your organism:'
For example, for human:
Then unzip it
now go back to the directory containing your ChIP-seq files.
3. The next step is to make the required FIRE input files, using the ChIPseeqer output file (e.g., TF_targets.txt) as input and run fire.
To do this, type the following command:
--peakfile=FILE File with ChIP-seq peaks. --peaksfolder=DIR Instead of peakfile, use this option to define a directory that contains several peakfiles. This will run FIRE using as many categories as the peakfiles. Useful to compare multiple peak lists motif-wise. --fastafile=INT Genome file (wg.fa) - must download it, instructions at INSTALL file, section DATA. --randmode=STR "adjacent" Selects immediately adjacent sequences. "1MM" Shuffles the original peak sequences. "CGI" Keeps the % of peaks that overlap with CpG islands. "random" Random sequences on the same chromosome. Define how to select random sequences. Default is "random". --genome=STR hg19 (human) hg18 (human) mm10 (mouse) mm9 (mouse) rn4 (rat) dm3 (drosophila) sacser (Saccharomyces cerevisiae) --species=STR human mouse rat drosophila yeast Use this option to match the discovered motifs against a database of known motifs. Need to download first the corresponding data from https://tavazoielab.c2b2.columbia.edu/FIRE/. --seed=INT When set to 1, FIRE runs with default seed for random intervals. Useful if you want to keep same random regions for different FIRE runs. Default is 1. --runit=INT When set to 1, FIRE runs directly. If set to 0 you need to run FIRE manually. Default is 1. --k=INT Define the size of the motifs to be discovered (values 3-8). Default is 7. --verbose=INT Verbose mode. Default is 0.
This command will create two files: TF_targets.txt.FIRE.txt and TF_targets.txt.FIRE.seq. By default, ChIPseeqerFIRE runs FIRE for you. However, if you'd like to run FIRE manually, you can use TF_targets.txt.FIRE.txt and TF_targets.txt.FIRE.seq as input (step 4).
IMPORTANT: To install FIRE see https://tavazoielab.c2b2.columbia.edu/FIRE/tutorial.html Don't forget to set the $FIREDIR variable and add it to your $PATH.
4. If you want to run FIRE independently of the ChIPseeqerFIRE script, type the following command:
fire.pl --expfile=TF_targets.txt.FIRE.txt --fastafile_dna=TF_targets.txt.FIRE.seq --nodups=1 --minr=2 --species=human --dorna=0 --dodnarna=0
The -- expfile and --fastafile_dna options must be the .txt and .seq files respectively that have been created by ChIPseeqerFIRE.
The following options are available:
-jn_t=INT between 0 and 10, define the robustness index threshold (default is 6) -k=INT defines the length of the k-mer seeds (default is 7) -expfiles="*.txt" processes all .txt expression files in current directory -sortmotifbyphase=1 Skip module discovery, sort motifs by phase (useful only for certain continuous expression profiles) -dodna=0 Skip the DNA combination phase (useful to redo only parts of the analysis) -dorna=0 Skip the RNA combination phase (useful to redo only parts of the analysis) -dodnarna=0 Skip the DNA/RNA combination phase (useful to redo only parts of the analysis)
IMPORTANT: You must define the --species option in order to get the motif names
4. See the results.
The previous step will create a folder with the "FIRE" extension.
Inside this folder you will find the .summary.pdf file which is the most important part of the FIRE analysis.
5. What can I do next ?
- Retrieve the peaks that have a specific FIRE motif. This is explained in Elementolab/Find_peaks_with_motif
If your peak list is > 100,000 lines
1. Go to ChIPseeqer/tools/FIRE/PROGRAMS Open dataio.c Go to line 253, should be: int mynmax = 100000; Change this number to a larger one, if your list of peaks is > 100000 2. Open mi_find.c Go to lines 466 and 468 Change 100,000 to a larger number For example, count_seq_index = (char*)calloc(250000, sizeof(char)); ... count_seq ++; if (count_seq == 250000) die("pb with count_seq > 250000\n"); 3. Open mi_signif.c Go to lines 301 and 304 Change 100,000 to a larger number For example, count_seq_index = (char*)calloc(250000, sizeof(char)); ... count_seq ++; if (count_seq == 250000) die("pb with count_seq > 250000\n"); 4. Go back to ChIPseeqer/src Run "make tools"