GeneTools

From Icbwiki

Jump to: navigation, search

Contents

geneTools

GeneTools is developed by Dr. David Soong at the [Elemento lab]. Please contact us at genetools.cornell@gmail.com if you have questions, suggestions, or would like to use our software.

IMPORTANT: If you need a copy of the software, please use "GeneTools software request" in the subject line when emailing us. Thank you.

Related pages

NGS Pipeline

This program generates Sun Grid Engine scripts for pre-processing reads, performing QC analysis, doing sequence alignment (TopHat, BWA), visualization, etc.

scripts/pipelineSeq.pl

Example 1

Assuming raw sequences were generated by Illumina Casava 1.8 and are stored in those big TAR files. After untarring them the reads are stored in Sample_name_R1_*.fastq.gz in data/Sample_*/ subdirectories. The following commands will:

  • Filter the raw reads by Illumina pass-filter tags and base quality.
  • Run BWA to align all these split small FASTQ files in parallel.
  • Convert the output SAM file to BAM format.
  • Split the SAM alignments by chromosome for SNVseeqer/CHIPseeqer.
  • Generate raw/unique read frequency files.
  • Generate wiggle tracks for visualization on the UCSC genome browser (will be uploaded to tmp_web_output/ on a local server).
# Generate lists containing files to process
for i in `ls -d data/Sample_*/` ; do
       outfile=${i}/seq_files.list
       ls $i/*.fastq.gz > $outfile
done

# Generate and submit scripts 
for i in `ls -d data/Sample_*`; do
       samplename=`basename $i`
       ./pipelineSeq.pl -list data/${samplename}/seq_files.list -mem 3g -time 5:00:00 -type chip -webdir tmp_web_output \
                       --filter -out alignments_chip/$samplename --casava18 -tag $samplename
       # submit Sun Grid Engine jobs 
       qsub qALN_CHIP_${samplename}.sh
       qsub -h qALN_CHIP_${samplename}_post.sh
       qalter -hold_jid qALN_CHIP_${samplename}.sh qALN_CHIP_${samplename}_post.sh
       qalter -h U qALN_CHIP_${samplename}_post.sh
done

Example 2

The following command will generate an SGE script (qALN_XXX_YYY.sh) that aligns the single-end FASTQ file against the hg19 genome with TopHat (using 4 CPUs), generates a BAM file, removes duplicates, generates read density tracks (raw + unique), and calculates the RPKM values.

./pipelineSeq.pl -i Sample1.fastq.gz -o output/ --rpkm -cpu 4 -sge _XXX_ -tag YYY -ref hg19 --nouploadwig

Processing FASTQ files

Filter and trim single-end reads

The following command keeps the nucleotides in positions 0-24 (total length of 25 bp) and filters out trimmed reads that have low quality (average score < 30.0).

./geneModel seq -cmd trimse -f data/sample_1_sequence.txt.gz -s 0 -len 25 -minscore 30.0 -o output/trimmed_sample_1_sequence.txt.gz
Options:
               -f               FASTQ file
               -o               Output FASTQ file
               -start           Starting read position (0-based)
               -len             Output read length
               -minscore        Minimum average quality score in each trimmed read (default: 30.0)
               --casava18       Indicate that the file is in CASAVA 1.8 format
               --PF             If input file is in CASAVA 1.8 format, further filter based on the Pass Filter tag

Filter and trim paired-end reads

The following command keeps the nucleotides in positions 0-24 (total length of 25 bp) and filters out trimmed reads that have low quality (either end with average score < 30.0). The read identifiers are output in the same order in the resulting RD1 and RD2 files.

./geneModel seq -cmd trimpe -f1 data/sample_1_sequence.txt.gz -f2 data/sample_2_sequence.txt.gz -s 0 -len 25 -minscore 30.0 
                            -o1 output/trimmed_sample_1_sequence.txt.gz -o2 output/trimmed_sample_2_sequence.txt.gz
Options:
               -f1              FASTQ file 1
               -f2              FASTQ file 2
               -o1              Output FASTQ file 1
               -o2              Output FASTQ file 2
               -start           Starting read position (0-based)
               -len             Output read length
               -minscore        Minimum average quality score in each trimmed read (default: 30.0)
               --casava18       Indicate that the files are in CASAVA 1.8 format
               --PF             If input files are in CASAVA 1.8 format, further filter based on the Pass Filter tag

Display numerical Phred score values

The following command converts the ASCII Phred codes to their numerical values.

./geneModel phred -seq 'be^eb]^^Ueef^fZabSbaag[QgdadbffdbG]YIXO[`]edb_f_T^Xe_ecbfeebcf\ffcfdff_BBBB'

Estimate insert size from paired-end reads

scripts/getInsertSize.pl estimates the average insert size from paired-end reads (FASTQ format). Insert size is defined as total selected fragment size - 2 * read length unless --total is specified.

Options:
       -i       FASTQ file for RD1
       -j       FASTQ file for RD2
       -o       Output file containing all insert sizes for all reads (optional)
       -log     Output file containing mean and median insert sizes
       -ref     Reference genome sequences
       --total  Report as insert size the average total length of the selected fragments (i.e. between the start
                coordinate of the left read in the read-pair and the end coordinate of the right read in the read-pair)
       -help

NOTE:

  • To run this program you need the bowtie aligner, a reference human genome, and its bowtie indices.

Alignment/SAM related

Convert RefSeq-based alignments to genome-based alignments

./geneModel align -cmd ref2g -i bwa_ref.sam -o bwa_genome.sam
Options:
                -i    input sam file
                -o    output sam file
                -ref  RefSeq data file

scripts/quickSAM2BAM.pl bwa_genome.sam outfile                     # convert SAM to BAM (binary format)
scripts/makeBAMtracks.pl -i outfile.srt.bam --nodry                # upload to genome browser

NOTE: make sure you use the RefSeq definition file that matches the RefSeq FASTA file you aligned to.

Remove duplicate reads from single-end alignments

Very often PCR amplification can generate an excessive amount of the same transcript. This interferes with the calculation of gene expression as well as the identification of mutations, so it is best to remove duplicates before performing such analyses. To remove duplicates for single-end reads, we retain the best N reads (according to average base quality scores) for each 5' alignment start position/strand. The latest implementation is much more memory efficient. It should take less than 5GB of RAM for 100 million unique alignments (Total=~ 125MB + 45MB per million unique alignments).

./geneModel rmdup2 -i accepted_hits.sam.gz -n 4 -o unique.sam            # outputs an uncompressed sam file
 or 
./geneModel rmdup2 -i accepted_hits.sam.gz -n 4 -o unique.sam.gz         # directly outputs a gzipped sam file

Remove duplicate reads from paired-end alignments

The following command extracts reads that have both ends aligned to the reference genome and outputs all unaligned reads to another file (allreadsFile) if specified. Clonal read detection is based on the mapping coordinates in both RD1 and RD2; among the duplicate reads, only the one with the highest average Phred quality score is kept. Reads that aligned to multiple locations are discarded.

./geneModel rmdup -cmd paired -f1 $sam1 -f2 $sam2 -o1 $sam_unq1 -o2 $sam_unq2 -unaln ${allreadsFile} > $logfile

Note: this program detects both intra-chromosomal and inter-chromosomal duplicates.

Get read counts for a list of positions (or SNVs) from a sorted BAM file

./geneModel readcnt -cmd bam -snv snv_list.txt -bam alignment_hits.srt.bam [--uniq] -o SNVs_cnt.txt
options:
                -snv    file containing a list of positions
                -bam    alignment file in BAM format
                -o      output file 
                --uniq  extract unique read counts (default= false)

snv list file format:
chr10   100479127
chr10   100479231
chr10   100479241
chr10   100480909
chr10   100495096

output file format: read count in the last column
chr10   100479127       2
chr10   100479231       3 
chr10   100479241       4
chr10   100480909       4 
chr10   100495096       3

Summarize gene expression levels from split SAM files (aligned against the hg18 genome)

./geneModel calcexp -sam split_dir/ -o output_file.exp [--unique]
options:
                -sam                 directory 
                -o                   output expression summary file
                --uniq               report gene expression levels based on unique reads       

output format: 
chr   gene name       transcript      utr5_cnt        utr5_len        coding_cnt      coding_len      utr3_cnt        utr3_len        total_cnt       total_len

Extract unaligned read sequences

./geneModel ali -cmd unaln -aln output/accepted_hits.sam.gz -read data/s_1_sequence.txt  > output/unaligned.fasta
options:
               -aln             alignment file
               -read            input fastq read sequence file
               --fasta          output unaligned reads in fasta format (default)
               --fastq          output unaligned reads in fastq format

Generate alignment statistics (e.g. # of clonal reads)

./geneModel sam -cmd sum -i alignments/rna001.sam.gz -o output/rna001_aln_summary.txt

Get read length from a SAM file

./geneModel align -cmd readlen -i sequences.fa

Convert from SAM format to ELAND

./geneModel sam2eland -i input -o output
Options:
               -input           Input SAM file
               -output          Output ELAND file (default= STDOUT)

Split a SAM file by chromosome

Split the alignments by chromosome for SNVseeqer or CHIPseeqer processing. This program can directly split compressed SAM files (*.sam.gz). The split files will be reads.chr* in the output directory.

./geneModel ali -cmd split -i my_sample.sam.gz -out split/ [ --fast ]

Options:
               -i               Input SAM file
               -out             Output directory containing the split SAM files
               --fast           Use this option if the input SAM file has not been coordinate sorted.

BAM related

Get read length from a BAM file

This program returns the length of the first encountered read in a BAM file.

./geneModel bam -cmd readlen         
Options: 
               -i               Input BAM file
               -o               Output file containing the read length

Calculate sequencing depth across the transcriptome

./geneModel bam -cmd depth -exon data/refGeneExons.txt -i alignments/sample.bam -o sample_seq_depth.txt
Options:
               -i               Input BAM file
               -exon            Exon regions (ChIP-seq format)
               -o               Output summary file

NOTE: the exon regions can be generated following the instruction here

Generate a read frequency Wiggle track from a BAM file

./geneModel bam -cmd wig -i output/accepted_hits.srt.bam -o output/read_cnt.wig [--uniq]
Options:
                -i     input BAM file
                -o     output Wiggle file
                --uniq when specified removes the clonal reads when calculating the seq. depth
                -hg    specify what human genome release to use (hg18 or hg19)

Calculate RPKM expression values from a sorted BAM file

./geneModel calcexp -cmd bamrpkm -i alignments/rna001.srt.bam [--uniq] -o output_2011/rna001_rpkm.txt
Options:
               -i               input BAM file
               -ref             RefSeq gene definition file
               --uniq           use unique reads only (default= no)
               -out             output RPKM file
               -hg              specify what human genome relese to use (hg18 or hg19)

Convert your Wiggle file to BigWig format and upload it to the Web server

scripts/makeBigWig.pl -i output/read_cnt.wig -webdir relative_dir_on_your_server/ -title "Seq. depth" -desc "Read counts" --nodry

SNV related

Detect SNVs from a BAM file

  • Call the SNVs from a sorted BAM file. This will also generate an SNV output file in SNVseeqer format (output_SNVsqr.txt). Our method of SNV detection is described in our PLoS One paper
./geneModel bam -cmd snv -i alignment/accepted_hits.bam -f ~/databases/hg18.fa --uniq -o SNVs.txt
Options:
          -cmd snv              Call SNVs from a BAM file. Here we assume a Poisson-Binomial model with
                                Benjamini Hochberg correction for multiple testing.
               -i               Input BAM file
               -o               Output SNV file (will also generate output_SNVsqr.txt in SNVseeqer format)
               -f               Reference genome (FASTA format)
               -min             Mininum number of supporting reads
               -pval            Threshold for adjust p-value (default: 0.01)
               -test            Type of statistical test for detecting SNVs:
                                   1. Binomial test
                                   2. Poisson-Binomial test
                                   3. Perform both tests
               -hg              Specify the human genome release to use (hg18 or hg19)
               -cpu             Number of CPUs to use (default= 1)
               --tmp            Keep temporary files
               --uniq           Use unique alignments only (i.e. remove clonal reads)
               -reg             If specified, extract information and call SNVs for this region

Output file format:

chr1    168122        C       2684            144          143     1/0/141/1       7.94002e-286    8.342e-205      1.24322e-281
  1. chromosome
  2. position (0-based)
  3. Reference nucleotide
  4. Total number of reads
  5. Total number of unique reads
  6. Number of mutated reads
  7. Number of mutated reads on the plus strand
  8. Allele frequency for mutations (A/C/G/T)
  9. p-value by Binomial test
  10. p-value by Poisson-Binomial test
  11. Adjusted p-value (by default based on Poisson-Binomial)

NEW: bam-snv now supports multi-threading (multiple CPUs) and has a small memory footprint (e.g. <2GB total RAM using 5 threads for Exon-seq files from Illumina HiSeq2000).

Look up the base quality for a set of SNVs from a BAM file

./geneModel bam -cmd base-quality [options]
Options:
               -i               Input file containing the genomic coordinates
               -bam             BAM file
               -f               Reference genome in samtools-indexed FASTA format
               -o               Output file
               --uniq           Remove PCR duplicates
               --keep           Keep intermediate file

UCSC browser related

Rescale a Wiggle track by total number of bp aligned

./geneModel util -cmd rescale-wig -i density_track.wig -o density_track_norm.wig
Options:
                -i              input Wiggle track
                -o              output Wiggle track
                -scale          further multiply by this value (default: 1E9)

Normalize a Wiggle/BedGraph track so mean=0 and std=1

./geneModel util -cmd normalize-wig -i density_track.wig -o density_track_norm.wig

Generate a Wiggle/BedGraph track of the difference between two Wiggle/BedGraph files

./geneModel util -cmd diff-wig -f1 density_file1.wig -f2 density_file2.wig > output_density_diff_1_2.wig

Generate a Wiggle track of the RPKM gene expression values

The following command uses column 3 of the input file as RPKM values and produces a UCSC track, assuming column 0 contains the gene symbols. A value is provided for every 1000bp (non-overlapping) genomic window. If --smooth is specified, each window's reported value is the average raw value across 9 neighboring windows.

./geneModel dna -cmd rpkm-density -res 1000 [--smooth]  -i data/rpkm_samples.txt -cidx 3 -o output/rpkm_sample2.wig

NOTE: this program can also be used to generate other types of tracks (e.g. copy number) as long as a value is provided for each gene.

Calculate gene density and generate a UCSC Wiggle track

./geneModel dna -cmd density -res 1000 [--smooth] -o output/gene_density_1kb_norm.wig 
Options:
               -res             Resolution (size of a block)
               -o               Output wiggle file name
               --smooth         If specified, use the average of the neighboring 8 blocks (4 on each side)

Sequence related

Retrieve genes at the specified chromosomal position and show corresponding sequence products (mRNA, protein, exon id)

./geneModel dna2ref -chr chr6 -pos 170730963

locating chr6:170730963
loading databases/ucsc/refGene.txt.25Nov2009
Total: 31803 genes
gene(s) found: NM_002598
NM_002598: strand= - start= 170728374 end= 170735673
name: PDCD2
mrna seq:
TCTTGCCTTCCGGCCCGGCGCCCGATTTCCGCCTTCCGACCCAGCTGTGGGCTGCGCCC[...]GAATTAAATAAATTTTGGTACATCCACA
coding seq:
ATGGCTGCCGCCGGGGCCAGGCCTGTGGAGCTGGGCTTCGCCGAGTCGGCGCCGGCGTGGC[...]TAACAGATACACCGTAA
coding length= 1035
protein seq:
MAAAGARPVELGFAESAPAWRLRSEQFPSKVGGRPAWLGAAGLPGPQALACELCGRPLSF[...]SCSLGTGYTEEFVWKQDVTDTP* 
length= 345
chr6:170730963 (0-based) refbase= T exon id: -1 runleft= -1
specified position not in exon or it's in a non-coding gene (exonid_cd == -1)

Transcribe a gene to mRNA from the genomic DNA sequence (hg18) based on refGene exon coordinates

./geneModel dna -cmd tr -gene NM_002598  --rna
loading SNPseeqer/REFDATA/refGene.txt.25Nov2009
>NM_002598
TCTTGCCTTCCGGCCCGGCGCCCGA[...]GACAGATTGCTTACACAGTGTCTACTTATTAGTGAAACAAAAGTGTCCAGTGACAGGGAATTAAATAAATTTTGGTACATCCACA

Translate a gene to its protein sequence based on refGene exon coordinates

./geneModel dna -cmd tr -gene NM_002598 
loading SNPseeqer/REFDATA/refGene.txt.25Nov2009
>NP_002589 NM_002598
MAAAGARPVELGFAESAPAWRLRSEQ[...]EKDIPDCPCGAKRILEFQVMPQLLNYLKADRLGKSIDWGILAVFTCAESCSLGTGYTEEFVWKQDVTDTP*

Get a random subset of paired-end reads from input FASTQ files

./geneModel rndpairseq [options]
               -f1              Read 1 FASTQ file
               -f2              Read 2 FASTQ file
               -o1              Output file for read 1 sequences
               -o2              Output file for read 2 sequences
               -n               Number of sequences to sample
               -top             If specified, only sample from the first (top) lines of the input files.
                                If (top) is 0, then there is no sampling and we simply output the first (n) sequences.
                                By default, we sample the sequences from the entire input files

Get all sequence lengths from a FASTA file

./geneModel dna -cmd len -i sequences.fa

DNA sequence [reverse] and/or [complement]

./geneModel dna -seq ctggtctcgaactcctgagctcaag
input:      ctggtctcgaactcctgagctcaag
complement: GACCAGAGCTTGAGGACTCGAGTTC
reverse:    gaactcgagtcctcaagctctggtc
rev. compl: CTTGAGCTCAGGAGTTCGAGACCAG
length:     25

Retrieve a genomic region from hg18 genome

  • NOTE: This function requires an installation of the reference FASTA file and proper indexing of the file with SAMTOOLS.
./geneModel <command> [options]
               -ref             Reference hg18 fasta and SAM-indexed files (e.g. $HG18DIR/hg18.fa)
               -region          Region to retrieve for (e.g. chr1:100-200, 1-based coordinates)
               -len             Length of sequence to retrieve if only the starting position is specified (e.g. -reg chr1:100 -len 20)
               --upper          Show sequence in UPPER CASE

Example:

./geneModel hg18 -reg chr5:135,540,630-135,540,639 --upper
TACTGTACTG
length= 10

Convert a FASTQ file to FASTA format (after quality filtering)

./geneModel seq -cmd conv2fasta -i fastq_sequence.txt.gz -o output_fasta.txt [-minq 15.0]
options:
                -i    input FASTQ file (txt or compressed)
                -o    output FASTA file (STDOUT if not specified)
                -minq minimum average Phred score for quality filtering

example output:
total: 18809829 reads
passed: 18319274 reads 
quality filter: 15

Gene related

Extracting all exonic regions from the genome

./geneModel dna -cmd exons -ref refGene.txt -o refGene_exons.txt
Options:
               -ref             RefSeq definition file
               -o               Output file
               --merge          Merge overlapping exons
               --full           Output transcript name + exon id

NOTE: The (non-overlapping) exon regions will be in ChIP-seq format.

Generate gene coordinates in interval format

This function can be used to generate gene body regions, promotor regions (e.g. TSS +/- 2kb), gene ends, or gene neighborhoods (e.g. TSS- 10kb ~ TES+10kb).

./gmm dna -cmd ranges -o data/all_gene_ranges_50kb.txt
Options:
               -i               Input file containing a list of genes
               -ref             RefSeq definition file
               -ext             Extend genes by n bp up of TSS and downstream of TES
               -hg              Specify genome (e.g. hg18, hg19)
               --TSS            Output coordinates around the TSS only. (--TSS trumps --TES)
               --TES            Output coordinates around the TES only
               --pair           Output in paired-block format
               --strand         Output additional column containing the strand
               -o               Output file

Get gene transcript structure

./geneModel dna -cmd struct -name NM_001134738

output:
NM_001134738: BCL6
chr3
strand: -
tx start: 188921858
cd start: 188922939
Exon 0: 188921858 - 188923083
Exon 1: 188925422 - 188925560
...
Exon 6: 188934014 - 188934185
Exon 7: 188935350 - 188935389
cd end: 188934175
tx end: 188935389

Convert gene identifiers

E.g. The following specifies that column 0 of the input file contains RefSeq transcript identifiers and adds a column of corresponding gene symbols.

./geneModel util -cmd convert-gene-id -i input_file.txt -cidx 0 --transcript > output_file.txt

Convert gene identifiers (advanced)

You can also convert identifiers in more than one column and between your choice of identifiers (UCSC, RefSeq, gene symbol). The following example converts the UCSC gene ids in columns 4 and 6 to gene symbols. This is done by fist using the convert-gene-id command on column 4 and feeding the intermediate output (through the pipe "|") to another convert-gene-id command which converts column 6.

There are seven types of conversion you can do:

  • 0: RefSeq -> Gene
  • 1: UCSC -> Gene
  • 2: Gene -> RefSeq
  • 3: Gene -> UCSC
  • 4: Gene -> Chr
  • 5: RefSeq -> Chr
  • 6: Gene -> ChIP-seq style interval
  • 7: RefSeq -> ChIP-seq style interval

E.g.

./geneModel util -cmd convert-gene-id -i input_file.txt -cid 4 -type 1 --uniq | ./geneModel util -cmd convert-gene-id -cid 6 -type 1 --uniq > final_output.txt

Here is a complete list of options:

                -i              Input file
                -cid            Column index of the query identifiers
                -sep            Field separator
                -type           Type of conversion to perform:
                                   0: RefSeq -> Gene
                                   1: UCSC -> Gene
                                   2: Gene -> RefSeq
                                   3: Gene -> UCSC
                                   4: Gene -> Chr
                                   5: RefSeq -> Chr
                                   6: Gene -> ChIP-seq style interval
                                   7: RefSeq -> ChIP-seq style interval
                -chip           ChIP-seq style interval file; required if type is 6 or 7
                --range         If type is 4 or 5 (above), then also output the start and end position of the gene
                --uniq
                --trans         Identifier is RefSeq transcript id (kpet for backward compability)

Process the duplicate records in the RefSeq definition file

There are occasionally RefSeq transcripts with identical ids assigned to disparate locations in the genome. We may handle such cases by regenerating a RefSeq definition file with unique identifiers along with their corresponding FASTA sequences. By default (without the --first option) a numerical suffix will be added to each duplicate transcript id so each will later generate its own sequence. On the other hand, the --first option, when specified, will keep each of the first encountered transcript records and ignore all other duplicate ids. Note that transcripts on haplotype and random chromosomes are filtered out.

./geneModel dna -cmd uniq-refseq -ref RefGene.txt -o RefGene_uniq.txt [--first]
               -ref             RefSeq definition file
               -o               Output unique definition file
               --first          If specified, only output the first qualifying record for duplicate transcript ids

NOTE: To generate the reference FASTA sequences from the unique RefSeq definition file, please use rebuildRefSeqFromRefGene following instructions in SNVseeqer database tools

Create data file for PAGE analysis

PAGE (Pathway Analysis of Gene Expression) is a program that identifies the enrichment of Gene Ontology or KEGG pathways from a list of user provided genes. You can easily create the PAGE input file from your list of genes using the following command. Read more about PAGE here.

./geneModel page -i input_genes.txt -o page_file.txt

NOTE: input_genes.txt contains genes that you believe are associated through some up-stream analysis (e.g. microarray co-expression analysis). It contains only one column, e.g.

ACTA2
ACTL7A
ACTL7B
ACTR1A 
...

ChIP-seq peaks related

Find overlap between two sets of peaks

./geneModel peakoverlap [options]
options:
               -i               peak file 1
               -j               peak file 2; will be used as reference in the output file
               -o               Output peak file
               -l               Log file

               -exti            Extend peaks in file 1 by exti bp on each side
               -extj            Extend peaks in file 2 by exti bp on each side

               --new            Only output the overlapping peak regions (i.e. without regions in peak file 2)
               --ref            If both --new and --ref are specified, output the reference peaks in file 2
                                (instead of the overlapping regions) that overlap with any peak in file 1
               --diff           If specified, show only the file2 peaks that do not overlap with any peaks in file1

Merge multiple sets of peaks

This program merges multiple sets of overlapping peaks (within and across input files) and generates a set of non-overlapping peaks.

./geneModel chip -cmd union -f file1.txt -f file2.txt -f file3.txt [-f ....] -o outfile.txt
options:
                 -f             ChIP-seq file. Multiple files can be specified using multiple -f arguments, e.g.
                                -cmd union -f file1.txt -f file2.txt -f file3.txt
                 -o             Output ChIP-seq file with the merged peaks

Example peaks:

File A:
        <---------->       <--------->
File B:
              <----->  <------>
File C:
                 <------>               <-------->
Output:
        <---------------------------->  <-------->

Annotate ChIP-seq peaks with gene information

./geneModel chip -cmd annotate-peaks [--gene] [--original] -i chip_seq_file.txt -o chip_seq_gene.txt
Options:
    --gene       if specified, output in gene centric view; shows for each gene the number of peaks in the promoter, gene body, and distal region.
    --original   show original peaks (and annotations)

Annotate one set of intervals with a reference set of intervals

This program annotates one set of intervals with a reference set of intervals. For example, for each query TF binding site, we can find out which cytoband it belongs to and then summarize/filter by the number of TF binding sites each cytoband contains.

E.g. the following command finds cytobands that have at least 5 BCL6 binding sites.

./geneModel util -cmd map-intervals -i data/BCL6_targets.txt -chip data/cytoband_positions.txt --filter -min 5

Usage: ./geneModel chip -cmd map-intervals [options]

Options:

                -i              File containing intervals to be annotated
                -chip           Reference set of ChIP-seq style intervals
                -out            Output file containing the annotations
                --filter        Filter based on the number of query intervals in each reference interval
                -min            Minimum number of query intervals in each reference interval
                -offset         If specified, the right end of each interval will be offset by this
                                value (default=1). This option is used to deal with cases where the
                                right side of the interval is 1-based.

Get distances to TSSs or TESs

./geneModel chip -cmd dist2tss [options]
Options:
                -i              input ChIP-seq file
                -maxdist        max. distance to a TSS/TES allowed
                -cutoff         how far should we look downstream of a TSS or upstream of a TES?
                --tes           use TESs instead of TSSs
                --orig          include original data in the output file

Calculate the density of ChIP-seq binding sites around specified positions

./geneModel chip -cmd cal-density [options]
                -i              Query file containing the query positions in ChIP-seq format (use - to indicate STDIN).
                -chip           Reference ChIP-seq file.
                -o              Output file.
                -max            Maximum distance to each query position. We search for ChIP-seq peaks in this neighborhood.
                -sep            Field separator used when parsing the query file.
                --orig          Show original columns in input file. If not specified, show only the peak intervals and density.

Generate a binary wiggle file at the specified resolution for fast lookup

 ./geneModel chip -cmd bin-wig [options]
                -i              Input Wiggle track
                -o              Output binary Wiggle file
                -res            Resolution (bin size)
                --norm          Normalize by resolution (i.e. calculate average read count instead of total read count)
                --peak          Treat each record as a peak (i.e. ignore column 4 and add count 1)

Print the content of a binary wiggle file

 ./geneModel chip -cmd print-bin-wig [options]
                -i              Binary wiggle file
                --interval      Print in ChIP-seq interval format (by default is off, i.e. shows in the internal format)
                --avg           Show average read count instead of total read count (i.e. normalize by bin size)
                --smooth
                -k
                --nonzero       If specified, print only the non-zero entries

Extend the peak intervals by n-bp on each side

 ./geneModel chip -cmd extendpeaks [options]
                 -i             Input ChIP-seq style interval file
                 -n             Number of basepairs to extend on each side of a peak
                 -o             Output file of extended peaks

Map frequency data to ChIP-seq intervals

 ./geneModel -cmd wiggle2chip [options]
                 -i             Wiggle file. The file can be in text (.wig) or binary (.wigbin) format.
                 -chip          ChIP-seq style interval file (query)
                 -o             Output file
                 -res           Resolution at which the frequency data is loaded into memory.
                 --peak         Treat each line in the input wiggle file as a binary peak. I.e. this ignores the peak height and width
                 -scale         Rescale the output frequency
                                0: do not rescale the frequency (default)
                                1: rescale the frequency by the size of the query interval
                                2: rescale the frequency by the number of bins that overlap with the query interval
                                This is used to find the average frequency from bins that overlap with the query interval.
                 --normbin      Normalize the output frequency by bin size.
                 --interval     If specified, output only the interval followed by the frequency. 
                                Everything else in the original ChIP-seq file is ignored in the output.

Generate files for differential FIRE analysis

 ./geneModel -cmd dfire [options]          
                 -f             Files (use -f file1 -f file2 -f file3 ...)
                 -t             Tags (use -t tag1 -t tag2 -t tag3 ...)
                 -ref           Reference human genome (samtools-indexed)
                 -o             Output file base (will append .FIRE.txt and .FIRE.seq)

Network related analyses

Output the degrees and annotations of genes in the network

./geneModel network -cmd degree -i network.sif -o output.txt
Options:
               -i               input SIF file
               -o               output file

Extract the N most connected genes (hubs) and perform PAGE pathway enrichment analysis

./geneModel network -cmd page -i network.sif -o output.txt -n topN
Options:
               -i               input SIF file
               -o               output PAGE file
               -n               top N hubs

Find connected sub-networks and output their component genes

./geneModel network -cmd subnet -i network.sif -o output.txt -n topN [--split]
Options:
               -i               input SIF file
               --split          if a node contains multiple gene names (e.g. "A1CF|GFER"), split the node according to the separator
               -sep             separator (default="|")
               -o               output file


Other functions

Generate random mappable regions

This generates n random regions with mappability score >= min based on a mappability file (interval/bedGraph format). Random sampling can be done in the entire genome or only in the specified chromosomes (-chr option).

 ./geneModel util -cmd gen-map

                -i              Mappability track
                -n              Generate <n> samples
                -res            Load mappability information at <res> resolution
                -min            Minimum mappability score required
                -chr            Chromosomes from which we sample [default: all chromosomes]
                --norm          Mappability track contains normalized scores
                --block         Output randomly selected regions as <res> bp blocks

E.g. This command generates 500 10kb regions with mappability > 0.9. The mappability file is in bed format and contains the mappability score for all 10kb windows in the genome.

./geneModel util -cmd gen-map -i data/mappability_50bp_chr_10K.txt -res 10000 -n 500 --norm -min 0.9

Perform Fisher's exact tests

This command performs Fisher's exact test on each line of the input file using data in the specified columns (-fields option). The four columns' values (a,b,c,d) are defined following the convention below (group x observation). Three columns will be appended to each line showing p-value(two.sides), p-value(greater) and p-value(less). Note that here we only deal with 2x2 contingency tables.

     | G1  |  G2
-----+-----+------
obs1 |  a  |   b
-----+-----+------
obs2 |  c  |   d
./geneModel util -cmd fisher -i my_counts.txt -fields 1,2,3,4 -o outfile_pval.txt

Perform Benjamini–Hochberg correction for multiple hypothesis testing

This command performs Benjamini–Hochberg correction on the p-values in the specified column of the input file.

./geneModel util -cmd bh-correction -i infile_pval.txt -col 7 -o outfile_padj.txt
Additional options:
                -pval           output only records with adjusted p-values less than this threadshold
                --small         use small memory algorithm

Calculate the histogram based on the input column(s) of a file

E.g. Use data in columns 7 and 8 of input_file.txt.gz and calculate the histogram with min=0, max=15000, and bin size=25.
./geneModel util -cmd cal-histogram -i input_file.txt.gz -min 0 -max 15000 -bin 25 -cidx 7,8 -o output_histogram.txt

Merge rows of the same key and output the averaged values per specified column per key

E.g. Use column 6 (gene symbol) of input_file.txt as key and output averaged values for columns 1,2,3,4. 

./geneModel util -cmd merge-row-val -i input_file.txt -kidx 6 -vidx 1,2,3,4 > merged.txt

input_file.txt
NM_001042697    28.1     38.9     17.6     10.2    9.7     ZSWIM7
NM_001042698    28.9     39.2     18.6     11.3    10.9    ZSWIM7

merged.txt
ZSWIM7  28.5     39.05    18.1     10.75

Find the intersection between two lists of items

./geneModel util -cmd list-overlap -f1 list1.txt -f2 list2.txt [--print]
options:
use -idx1 and -idx2 to specify which columns in the files the items are in (default=0).

Find the difference between two lists of items (set2 - set1)

./geneModel util -cmd list-overlap -f1 list1.txt -f2 list2.txt --diff [--print]
options:
use -idx1 and -idx2 to specify which columns in the file the items are in (default=0).

Filtering: compare column(s) of a file to a reference list

Print records whose specified columns contain any of the items provided in the master list/reference file.

./geneModel util -cmd filter-from-list
Options:
                -i              Input file
                -tgidx          Column indices in the input (target) file (comma-separated)
                -list           Master list/reference file
                -lidx           Column index in the list file
                -match          Tag to show when the query columns are in the list file. if not specified, the matching key will be shown
                --showall       If specified, show all lines whether or not the query columns match the list file

Find the overlap between two files based on multiple columns

This can be used when the key (e.g. chromosomal region) is split into several fields (e.g. chr, start pos, end pos). Lines in file2 with keys that already appeared in file1 will be output.

E.g. the following command prints out the lines in file2.txt whose columns 0,1,2 appear in file1.txt. 
./geneModel util -cmd file-overlap -f1 file1.txt -f2 file2.txt -idx1 0,1,2 -idx2 0,1,2 > output_file

Find the difference between two files based on multiple columns

This can be used when the key (e.g. chromosomal region) is split into several fields (e.g. chr, start pos, end pos). Lines in file2 with keys not seen in file1 will be output.

E.g. the following command prints out the lines in file2.txt whose columns 0,1,2 are not in file1.txt. 
./geneModel util -cmd file-difference -f1 file1.txt -f2 file2.txt -idx1 0,1,2 -idx2 0,1,2 > output_file

Add columns from a reference file to a query input file

./geneModel util -cmd add-columns -i input_file 
          -cmd add-columns      Add columns from a reference data file to the input file after matching gene names
                -i              Input file
                -dat            Reference data file.
                -key            Column index of the key in the input file
                -ref            Column index of the key in the reference data file
                -val            Column indices of the values to extract
                -na             Tag for genes without matching data in the reference file

Extract columns from multiple files and merge by key (row)

E.g. Here we have three files containing gene expression values from four microarray (exp_array.txt) and RNA-seq experiments (rna001_rpkm.txt and rna002_rpkm.txt). We want to merge these files and output all four expression values for every gene. Let's say the input files have the following format:

exp_array.txt:

(format: gene, expression in exp1, expression in exp2)
AAA   8.305    5.506        

rna001_rpkm.txt:

(format: gene, transcript, total length of exons, number of reads, rpkm)
AAA     NM_XXX       9342    4653    3.17929

and rna002_rpkm.txt:

(format: gene, transcript, total length of exons, number of reads, rpkm)
AAA     NM_XXX        918     4188    0.339037        

We can then extract the gene expression values by specifying the columns where the keys (genes) and expression values are using:

./geneModel util -cmd merge-col -f output/exp_array.txt -f output/exp_array.txt -f output/rna001_rpkm.txt -f output/rna002_rpkm.txt \
                                   -key 0,0,0,0 -val 1,2,4,4 -o output/exp_RNAseq_array.txt
Additional option:
            -missing     what to print when a key does not exist in the input file (default= "")

The output will be something like:

 AAA  8.305    5.506   3.17929  0.339037

Full list of options:

          -cmd merge-columns    Merge multiple files while extracting corresponding keys (e.g. gene names)
                -f              Input files. You can use multiple instances of -f to specify your files. However, if
                                you have a large number of files, you can use something like -f "`ls myfiles*.txt`"
                                to have the OS expand the file list for you
                -names          Column names
                -coltype        Column name type if -name is not provided.
                                1. basename: use file name
                                2. dirname: use full directory name
                                3. simpledir: use current directory name
                                4. full: use full file name
                -o              Output file
                -key            Key indices (comma separated)
                -val            Value indices (comma separated)
                -miss           Output string for missing values (default= )
                --common        If specified, output only the lines that occur in ALL INPUT FILES
                --verbose       Verbose mode

Example 1:

  • This extracts the FPKM values for genes from Tophat output directories (output/expression/sample_names) and uses the sample names (directory names) as the column names in the summary output (summary_gene_FPKM.txt).
./geneModel util -cmd merge-col -f "`ls output/expression/*/genes.fpkm_tracking`" -coltype simpledir -key 0 -val 9 -o output/summary_gene_FPKM.txt

Convert numerical chr ids (1..24) to string chr ids (chr1..chrY) in the specified column of a file

./geneModel util -cmd convert-num -i file.txt -col 0
                -i              Input file
                -o              Output file
                -d              Delimiter
                -col            Column index (0-based)

If input file name is empty, the program reads in the data stream from STDIN.

Displaying all commands

./geneModel help all

Utility scripts

Generate SGE script files from a command file (one SGE file per command line)

To generate SGE scripts (e.g. qHiC2011_0.sh) from a file containing one command per line:

scripts/makeSGEjob_per_line.pl -input batchHic2011.sh -sge qHiC2011 -wall 5:00:00
 

To submit the SGE jobs, use:

for i in `ls qHiC2011_*.sh`; do qsub $i; done

Options for makeSGEjob_per_line.pl:

-input   input file containing command lines
-sge     output SGE script file base name
-wall    wall time
-inst    additional instructions for the SGE scheduler (e.g. -q to specify the platform)
--submit submit the jobs to the SGE immediately after generating the scripts
--array  generate scripts that run in ARRAY JOB MODE

NOTE:

  • makeSGEjob_per_line.pl takes input from STDIN so you can use it to directly start a job.

E.g. The following command starts a BLAT server that runs 36 hours:

echo "gfServer start localhost XXXX -stepSize=5 -log=untrans.log ~/databases/wg/hg18.fa.2bit " |
 ./makeSGEjob_per_line.pl -input - -sge qBLAT_server -mem 5g -wall 36:00:00 --log --submit
Personal tools