Elementolab/Mappability

From Icbwiki

Jump to: navigation, search

This is a step by step guide on computing genome mappability.

  • Step 1: k-mer counting using GenomeTools from Stefan Kurtz and colleagues:

a. Download the GenomeTools from http://genometools.org/

b. Compile the genome tools

make cairo=no 64bit=yes
make prefix=$HOME/usr cairo=no 64bit=yes install # not always necessary, the exec are in bins/

c. Run:

gt suffixerator -dna -pl -tis -suf -lcp -v -parts 4 -db hg18/wg.fa -indexname reads
gt tallymer mkindex -mersize 30 -minocc 1 -indexname tyr-reads -counts -pl -esa reads -scan
gt tallymer search -output qseqnum qpos counts sequence -tyr tyr-reads -q hg18/wg.fa > readmap.txt

The full manual for the tallymer software can be found at http://www.zbh.uni-hamburg.de/tallymer/tallymer.pdf

Alternatively, ChIPseeqer comes with a script to automate the creation of readmap.txt:

perl $CHIPSEEQERDIR/SCRIPTS/PBS_runGT_makeReadMap.pl --genome=hg18/wg.fa --submit=1
  • Step 2: Run MakeMappabilityStats

Calculate chromosome lengths using fasta_lengths.pl (important)

$CHIPSEEQERDIR/SCRIPTS/fasta_lengths.pl hg18.fa > chr_lengths.txt 

Next step, is to calculate the genomic coordinates of gaps in the genome:

for f in chr*.fa; do perl $CHIPSEEQERDIR/SCRIPTS/quick_find_N.pl $f ; done > chr_regions_with_Ns.txt

MakeMappabilityStats is a perl script under ChIPseeqer directory.

$CHIPSEEQERDIR/MakeMappabilityStats chr_lengths.txt chr_regions_with_Ns.txt readmap.txt > map_stats.txt

The input files are:

chr_lengths: a two-column file that contains chr name and chr len.
chr_regions_with_Ns: an interval file that contain the coordinate of each stretch of Ns in the genome .... (3 cols: chr, start, end)
readmap: the file produced from Step 1.

The map_stats.txt file will have 3 cols: chr, chr_length, chr_mappability

Rename the file to species.chrdata and you are ready!

Personal tools