Elementolab/Mappability
From Icbwiki
This is a step by step guide on computing genome mappability.
- Step 1: k-mer counting using GenomeTools from Stefan Kurtz and colleagues:
a. Download the GenomeTools from http://genometools.org/
b. Compile the genome tools
make cairo=no 64bit=yes make prefix=$HOME/usr cairo=no 64bit=yes install # not always necessary, the exec are in bins/
c. Run:
gt suffixerator -dna -pl -tis -suf -lcp -v -parts 4 -db hg18/wg.fa -indexname reads gt tallymer mkindex -mersize 30 -minocc 1 -indexname tyr-reads -counts -pl -esa reads -scan gt tallymer search -output qseqnum qpos counts sequence -tyr tyr-reads -q hg18/wg.fa > readmap.txt
The full manual for the tallymer software can be found at http://www.zbh.uni-hamburg.de/tallymer/tallymer.pdf
Alternatively, ChIPseeqer comes with a script to automate the creation of readmap.txt:
perl $CHIPSEEQERDIR/SCRIPTS/PBS_runGT_makeReadMap.pl --genome=hg18/wg.fa --submit=1
- Step 2: Run MakeMappabilityStats
Calculate chromosome lengths using fasta_lengths.pl (important)
$CHIPSEEQERDIR/SCRIPTS/fasta_lengths.pl hg18.fa > chr_lengths.txt
Next step, is to calculate the genomic coordinates of gaps in the genome:
for f in chr*.fa; do perl $CHIPSEEQERDIR/SCRIPTS/quick_find_N.pl $f ; done > chr_regions_with_Ns.txt
MakeMappabilityStats is a perl script under ChIPseeqer directory.
$CHIPSEEQERDIR/MakeMappabilityStats chr_lengths.txt chr_regions_with_Ns.txt readmap.txt > map_stats.txt
The input files are:
chr_lengths: a two-column file that contains chr name and chr len. chr_regions_with_Ns: an interval file that contain the coordinate of each stretch of Ns in the genome .... (3 cols: chr, start, end) readmap: the file produced from Step 1.
The map_stats.txt file will have 3 cols: chr, chr_length, chr_mappability
Rename the file to species.chrdata and you are ready!