TEPSS

From Icbwiki

Jump to: navigation, search

Contents

Background

TEPSS is a new TissueInfo development which can be used to scan genomes for transcripts that have similar expression to a query transcript. In a nutshell, it is like smith-waterman/blast searches for tissue expression profiles. Usage of the TEPSS method should be acknowledged by citing the following publication. Please consult this article for details about the method.

Beyond tissueInfo: functional prediction using tissue expression profile similarity searches. Aguilar D, Skrabanek L, Gross SS, Oliva B, Campagne F. Nucleic Acids Res. 2008 Jun;36(11):3728-37. Epub 2008 May 15. Access the publication

Software Requirements

Getting the library

The TEPSS package is distributed as a precompiled jar files and also in source code form. Distribution types are described in the following sections.

Binary Distribution

The binary distribution of the TEPSS package contains two jar files as described as follows:

tisim.jar
includes all the external classes needed to run
tissueinfo.jar
includes all the project classes and will require external libraries for use in your own projects

Count Information Distribution

The count information required to run TEPSS can be downloaded using the following links:

This information is derived from dbEST using TissueInfo.

Source Distribution

The source distribution of the TEPSS package contains the Java source code along with supporting files that are used to compile and test the package. See this page to download the source code.

Building

Note that this section is meant only for those with the source distribution or subversion access. Users of the binary distribution should skip this section.

Compiling and packaging

The target used to build the TEPSS package is called "jar". Executing ant jar will produce the a file called "tisim.jar" and tissueinfo.jar in the <install-dir>.

Subversion Access from the ICB local environment

This project's Subversion repository can be checked out through SVN with the following instruction set:

 svn co https://pbtech-vc.med.cornell.edu/public/svn/icb/trunk/tissueinfo

Additionally you can browse the tissueinfo project source code in the Subversion repository.

NOTE: Use username "guest" and your email address at the login prompt if you do not have an account with the ICB.

TiSimilarity

Given sequence accession codes, TiSimilarity produces a ranked list of sequences with similar tissue expression profiles.

Example

As an example, consider the task of finding those human transcripts whose expression in human most resemble that of APP (Amyloid Precursor Protein). TiSimilarity (tisim for short) can be used as follows:

java -jar tisim.jar --mode single -q ENST00000346798 -b Homo_sapiens.NCBI36.46.masked.counts-data -k 20
-s edu.cornell.med.icb.tissueinfo.similarity.scorers.ExpressionConfidenceScorer

(Alternatively, you can use the tisim ant target in the ant build file <tissueinfo>/sim/run.xml You will be prompted for a comma delimited list of transcript accession codes.)

tisim then searches the compressed genome representation constructed with TissueInfo. A detailed description of the expected output can be found here.

Program Parameters

This section describes the various options that can be given to the TiSimilarity program.

Global Options

Flag Arguments Required Description
(-b | --basename) basename yes The base name (prefix) of the compressed count data. For example, the basename parameter for the human count data provided above should be Homo_sapiens.NCBI36.46.masked.counts-data. Similarly, the basename parameter when using the mouse count data provided above should be Mus_musculus.NCBIM36.46.masked.counts-data.
(-m | --mode) mode yes Mode of execution, one of: single, backward-screen, pairs-scores-only, loo-forward, pairs, list, transcript-list, tally-scores, group-scoring, forward-screen
(-k | --k-best-targets) k (default: -1) no Number of top scoring sequences to report.
(-f | --genome-filter) genome-filter no File with transcript ids to be considered part of the genome. If provided, transcripts not in this file are ignored for scoring. This option affects the rank of results.
(-s | --scorer) scorerClass no Name of a scorer. Must be the fully qualified classname of an implementation of edu.cornell.med.icb.tissueinfo.similarity.scorers.ExpressionProfileSimilarityScorer. Please see the TEPSS manuscript for a list of scorers and their relative performance.
(-y | --rank-by-score) n/a no When present, ranks screen by sum of score otherwise, ranks by sum of inverse rank.
(-c | --count-preprocessor-chain) class1, class2, ... no A comma delimited list of class names. Each class must implement the edu.cornell.med.icb.tissueinfo.similarity.preprocessors.CountPreprocessor interface and will be used in sequence to preprocess tissue count information before scoring transcripts. Implementations are available for shuffling counts or precalculating confidence scores.
(-o | --output) filename no Name of the output file. If not specified, output will be to the console.
(-h | --help) n/a no Print help message

Mode Options

single

This mode is used to search for accession codes against the rest of the transcriptome. It is implemented by edu.cornell.med.icb.tissueinfo.similarity.SimpleSearchMode.

Flag Arguments Required Description
(-q | --acs) ac1, ac2, ... yes A comma separated list of accesson codes
backward-screen

It is implemented by edu.cornell.med.icb.tissueinfo.similarity.BackwardMode.

Flag Arguments Required Description
(-i | --input) filename yes Name of a file containing a list of query sequence accession codes, one per line or the name "all" which will search with all transcripts
(-r | --restrict) filename yes Name of restrict file. A restrict file contains a list of sequence identifiers. When the restrict option is active, output is reported only for the sequence identifiers provided. The searchFullGenome is still against the full genome, but instead of reporting each matching target sequence, the output reports only about identifiers in the restrict list. The -k option is ignored when the restrict option is active.
(-w | --weights) n/a no Weight tissues using counts in restrict list. When this option is active, tissues from the restrict list are used to calculate tissue weights. Weights are then used during scoring.
pairs-scores-only

This mode evaluates scores, but not ranks, for pairs of transcripts. The performance of this mode is linear in the number of pairs. It is implemented by edu.cornell.med.icb.tissueinfo.similarity.PairScoreOnlyMode.

Flag Arguments Required Description
(-p | --pair-list) filename yes The name of a file containing a list of accession code pairs, one pair per line.
loo-forward

It is implemented by edu.cornell.med.icb.tissueinfo.similarity.LeaveOneOutMode.

Flag Arguments Required Description
(-i | --input) filename yes Name of a file containing a list of query sequence accession codes, one per line or the name "all" which will search with all transcripts
(-g | --gene-transcript-relationships) filename yes A file with a gene transcript pair per line. First field is gene id, second field is transcript id (tab delimited). Each line indicates that the transcript is the product of the gene. This information is used to leave out all the transcripts corresponding to a gene.
(-w | --weights) n/a no Weight tissues using counts in restrict list. When this option is active, tissues from the restrict list are used to calculate tissue weights. Weights are then used during scoring.
(-e | --extra-input-set) filename no Extra transcripts for scoring. Transcripts in this set are never left out. This file must contain a list of transcript IDs (one per line).
--use-rocr n/a no Use ROCR to calculate AUC and plot a ROC curve. The curve is plotted in PDF format with ROCR R package. A connection to R must be possible through Rserve for this option to work. Additionally, the Rserve process must be able to write to the directory where the file is written (generally the current working directory unless specified by the rocr-plotfile option). The file where the ROC curve is printed is listed in the statistics file.
--rocr-plotfile filename no Path to the name of a pdf file you wish the ROC curve to be written to. This option is ignored if the --use-rocr option is not specified. Specifying this file is especially useful when the Rserve process is running as another user.
cv-forward

It is implemented by edu.cornell.med.icb.tissueinfo.similarity.CrossValidationMode.

Flag Arguments Required Description
(-i | --input) filename yes Name of a file containing a list of query sequence accession codes, one per line or the name "all" which will search with all transcripts
(-g | --gene-transcript-relationships) filename yes A file with a gene transcript pair per line. First field is gene id, second field is transcript id (tab delimited). Each line indicates that the transcript is the product of the gene. This information is used to leave out all the transcripts corresponding to a gene.
(-w | --weights) n/a no Weight tissues using counts in restrict list. When this option is active, tissues from the restrict list are used to calculate tissue weights. Weights are then used during scoring.
(-e | --extra-input-set) filename no Extra transcripts for scoring. Transcripts in this set are never left out. This file must contain a list of transcript IDs (one per line).
(-x | --folds) folds yes Number of cross validation folds.
(-r | --seed) seed yes Number to use as seed for random number generator.
pairs

Evaluates scores and ranks for pairs of transcripts. Evaluating rank requires the comparison of each transcript of a pair to the whole genome. The complexity of this mode is proportional to the number of transcripts in the genome, and to the number of pairs. Use Pairs-score-only if the rank is not required. It is implemented by edu.cornell.med.icb.tissueinfo.similarity.PairSearchMode.

Flag Arguments Required Description
(-p | --pair-list) filename yes The name of a file containing a list of accession code pairs, one pair per line.
list

This mode is used to scan a list of accession codes. It is implemented by edu.cornell.med.icb.tissueinfo.similarity.ListSearchMode.

Flag Arguments Required Description
(-i | --input) filename yes Name of a file containing a list of query sequence accession codes, one per line or the name "all" which will search with all transcripts
(-w | --weights) n/a no Weight tissues using counts in restrict list. When this option is active, tissues from the restrict list are used to calculate tissue weights. Weights are then used during scoring.
transcript-list

It is implemented by edu.cornell.med.icb.tissueinfo.similarity.TranscriptListMode.

Flag Arguments Required Description
(-g | --gene-to-transcripts-file) filename yes Name of the file containing the association between Gene IDs and Trasncript IDs (tab delimited format).
(-t | --type-of-transcripts) type yes1,2 Type of transcripts (Values: all, nonzero, zero).
(-i | --input) filename yes1 Name of file containing a list of accession codes (either Ensembl Gene IDs or Ensembl Transcript IDs).
(-n | --number-of-transcripts-per-gene) n no Limit transcript output to n per gene. (Value: int). (default: no limit)
(-r | --random-sample) sample no3 Random sample of transcripts (Value: int). (default: 0)
(-a | --tabulated-output) n/a no2 Tabulated output in the form "query ID<tab>Target ID"
1One of options -i and -t are required, but are mutually exclusive and cannot be used together

2 Options -t and -r cannot be used along with option -a.

3 The random sampling will be taken after the number of transcripts are limited when option -n is specified.

tally-scores

This mode performs a genome x genome search and tally the number of scores that fit into configurable score bins. It is implemented by edu.cornell.med.icb.tissueinfo.similarity.TallyScoreSearchMode.

Flag Arguments Required Description
(-x | --num-bins) num-bins no The number of score bins to use for tally. (default: 100)
--score-span score-span no The maximum span of score values that will be recorded in the bins.(default: 2000)
group-scoring

It is implemented by edu.cornell.med.icb.tissueinfo.similarity.GroupScoringMode.

Flag Arguments Required Description
(-i | --input) filename yes Name of a file containing a list of query sequence accession codes, one per line or the name "all" which will search with all transcripts
(-w | --weights) n/a no Weight tissues using counts in restrict list. When this option is active, tissues from the restrict list are used to calculate tissue weights. Weights are then used during scoring.
forward-screen

It is implemented by edu.cornell.med.icb.tissueinfo.similarity.ForwardMode.

Flag Arguments Required Description
(-i | --input) filename yes Name of a file containing a list of query sequence accession codes, one per line or the name "all" which will search with all transcripts
(-w | --weights) n/a no Weight tissues using counts in restrict list. When this option is active, tissues from the restrict list are used to calculate tissue weights. Weights are then used during scoring.
Personal tools