TEPSS
From Icbwiki
Contents |
Background
TEPSS is a new TissueInfo development which can be used to scan genomes for transcripts that have similar expression to a query transcript. In a nutshell, it is like smith-waterman/blast searches for tissue expression profiles. Usage of the TEPSS method should be acknowledged by citing the following publication. Please consult this article for details about the method.
Beyond tissueInfo: functional prediction using tissue expression profile similarity searches. Aguilar D, Skrabanek L, Gross SS, Oliva B, Campagne F. Nucleic Acids Res. 2008 Jun;36(11):3728-37. Epub 2008 May 15. Access the publication
Software Requirements
- Java Runtime 1.6+
- Fastutil Fast and compact type-specific collections for Java
- JSAP Java-based Simple Argument Parser
- MG4J Managing Gigabytes for Java
- DSI Utilities
- Commons Logging
- log4j
- Commons Configuration
- Commons Lang
- Commons Math
- Commons CLI
- Commons I/O
- Commons Collections
- Colt Open Source Libraries for High Performance Scientific and Technical Computing in Java
- JGAP Java Genetic Algorithms Package
- PJ Parallel Java Library
- Crover CLI
- QtClustering Library
- Apache Ant and Ant-Contrib (if building library from source code)
Getting the library
The TEPSS package is distributed as a precompiled jar files and also in source code form. Distribution types are described in the following sections.
Binary Distribution
The binary distribution of the TEPSS package contains two jar files as described as follows:
- tisim.jar
- includes all the external classes needed to run
- tissueinfo.jar
- includes all the project classes and will require external libraries for use in your own projects
Count Information Distribution
The count information required to run TEPSS can be downloaded using the following links:
This information is derived from dbEST using TissueInfo.
Source Distribution
The source distribution of the TEPSS package contains the Java source code along with supporting files that are used to compile and test the package. See this page to download the source code.
Building
Note that this section is meant only for those with the source distribution or subversion access. Users of the binary distribution should skip this section.
Compiling and packaging
The target used to build the TEPSS package is called "jar". Executing ant jar will produce the a file called "tisim.jar" and tissueinfo.jar in the <install-dir>.
Subversion Access from the ICB local environment
This project's Subversion repository can be checked out through SVN with the following instruction set:
svn co https://pbtech-vc.med.cornell.edu/public/svn/icb/trunk/tissueinfo
Additionally you can browse the tissueinfo project source code in the Subversion repository.
NOTE: Use username "guest" and your email address at the login prompt if you do not have an account with the ICB.
TiSimilarity
Given sequence accession codes, TiSimilarity produces a ranked list of sequences with similar tissue expression profiles.
Example
As an example, consider the task of finding those human transcripts whose expression in human most resemble that of APP (Amyloid Precursor Protein). TiSimilarity (tisim for short) can be used as follows:
java -jar tisim.jar --mode single -q ENST00000346798 -b Homo_sapiens.NCBI36.46.masked.counts-data -k 20 -s edu.cornell.med.icb.tissueinfo.similarity.scorers.ExpressionConfidenceScorer
(Alternatively, you can use the tisim ant target in the ant build file <tissueinfo>/sim/run.xml You will be prompted for a comma delimited list of transcript accession codes.)
tisim then searches the compressed genome representation constructed with TissueInfo. A detailed description of the expected output can be found here.
Program Parameters
This section describes the various options that can be given to the TiSimilarity program.
Global Options
| Flag | Arguments | Required | Description |
|---|---|---|---|
| (-b | --basename) | basename | yes | The base name (prefix) of the compressed count data. For example, the basename parameter for the human count data provided above should be Homo_sapiens.NCBI36.46.masked.counts-data. Similarly, the basename parameter when using the mouse count data provided above should be Mus_musculus.NCBIM36.46.masked.counts-data. |
| (-m | --mode) | mode | yes | Mode of execution, one of: single, backward-screen, pairs-scores-only, loo-forward, pairs, list, transcript-list, tally-scores, group-scoring, forward-screen |
| (-k | --k-best-targets) | k (default: -1) | no | Number of top scoring sequences to report. |
| (-f | --genome-filter) | genome-filter | no | File with transcript ids to be considered part of the genome. If provided, transcripts not in this file are ignored for scoring. This option affects the rank of results. |
| (-s | --scorer) | scorerClass | no | Name of a scorer. Must be the fully qualified classname of an implementation of edu.cornell.med.icb.tissueinfo.similarity.scorers.ExpressionProfileSimilarityScorer. Please see the TEPSS manuscript for a list of scorers and their relative performance. |
| (-y | --rank-by-score) | n/a | no | When present, ranks screen by sum of score otherwise, ranks by sum of inverse rank. |
| (-c | --count-preprocessor-chain) | class1, class2, ... | no | A comma delimited list of class names. Each class must implement the edu.cornell.med.icb.tissueinfo.similarity.preprocessors.CountPreprocessor interface and will be used in sequence to preprocess tissue count information before scoring transcripts. Implementations are available for shuffling counts or precalculating confidence scores. |
| (-o | --output) | filename | no | Name of the output file. If not specified, output will be to the console. |
| (-h | --help) | n/a | no | Print help message |
Mode Options
single
This mode is used to search for accession codes against the rest of the transcriptome. It is implemented by edu.cornell.med.icb.tissueinfo.similarity.SimpleSearchMode.
| Flag | Arguments | Required | Description |
|---|---|---|---|
| (-q | --acs) | ac1, ac2, ... | yes | A comma separated list of accesson codes |
backward-screen
It is implemented by edu.cornell.med.icb.tissueinfo.similarity.BackwardMode.
| Flag | Arguments | Required | Description |
|---|---|---|---|
| (-i | --input) | filename | yes | Name of a file containing a list of query sequence accession codes, one per line or the name "all" which will search with all transcripts |
| (-r | --restrict) | filename | yes | Name of restrict file. A restrict file contains a list of sequence identifiers. When the restrict option is active, output is reported only for the sequence identifiers provided. The searchFullGenome is still against the full genome, but instead of reporting each matching target sequence, the output reports only about identifiers in the restrict list. The -k option is ignored when the restrict option is active. |
| (-w | --weights) | n/a | no | Weight tissues using counts in restrict list. When this option is active, tissues from the restrict list are used to calculate tissue weights. Weights are then used during scoring. |
pairs-scores-only
This mode evaluates scores, but not ranks, for pairs of transcripts. The performance of this mode is linear in the number of pairs. It is implemented by edu.cornell.med.icb.tissueinfo.similarity.PairScoreOnlyMode.
| Flag | Arguments | Required | Description |
|---|---|---|---|
| (-p | --pair-list) | filename | yes | The name of a file containing a list of accession code pairs, one pair per line. |
loo-forward
It is implemented by edu.cornell.med.icb.tissueinfo.similarity.LeaveOneOutMode.
| Flag | Arguments | Required | Description |
|---|---|---|---|
| (-i | --input) | filename | yes | Name of a file containing a list of query sequence accession codes, one per line or the name "all" which will search with all transcripts |
| (-g | --gene-transcript-relationships) | filename | yes | A file with a gene transcript pair per line. First field is gene id, second field is transcript id (tab delimited). Each line indicates that the transcript is the product of the gene. This information is used to leave out all the transcripts corresponding to a gene. |
| (-w | --weights) | n/a | no | Weight tissues using counts in restrict list. When this option is active, tissues from the restrict list are used to calculate tissue weights. Weights are then used during scoring. |
| (-e | --extra-input-set) | filename | no | Extra transcripts for scoring. Transcripts in this set are never left out. This file must contain a list of transcript IDs (one per line). |
| --use-rocr | n/a | no | Use ROCR to calculate AUC and plot a ROC curve. The curve is plotted in PDF format with ROCR R package. A connection to R must be possible through Rserve for this option to work. Additionally, the Rserve process must be able to write to the directory where the file is written (generally the current working directory unless specified by the rocr-plotfile option). The file where the ROC curve is printed is listed in the statistics file. |
| --rocr-plotfile | filename | no | Path to the name of a pdf file you wish the ROC curve to be written to. This option is ignored if the --use-rocr option is not specified. Specifying this file is especially useful when the Rserve process is running as another user. |
cv-forward
It is implemented by edu.cornell.med.icb.tissueinfo.similarity.CrossValidationMode.
| Flag | Arguments | Required | Description |
|---|---|---|---|
| (-i | --input) | filename | yes | Name of a file containing a list of query sequence accession codes, one per line or the name "all" which will search with all transcripts |
| (-g | --gene-transcript-relationships) | filename | yes | A file with a gene transcript pair per line. First field is gene id, second field is transcript id (tab delimited). Each line indicates that the transcript is the product of the gene. This information is used to leave out all the transcripts corresponding to a gene. |
| (-w | --weights) | n/a | no | Weight tissues using counts in restrict list. When this option is active, tissues from the restrict list are used to calculate tissue weights. Weights are then used during scoring. |
| (-e | --extra-input-set) | filename | no | Extra transcripts for scoring. Transcripts in this set are never left out. This file must contain a list of transcript IDs (one per line). |
| (-x | --folds) | folds | yes | Number of cross validation folds. |
| (-r | --seed) | seed | yes | Number to use as seed for random number generator. |
pairs
Evaluates scores and ranks for pairs of transcripts. Evaluating rank requires the comparison of each transcript of a pair to the whole genome. The complexity of this mode is proportional to the number of transcripts in the genome, and to the number of pairs. Use Pairs-score-only if the rank is not required. It is implemented by edu.cornell.med.icb.tissueinfo.similarity.PairSearchMode.
| Flag | Arguments | Required | Description |
|---|---|---|---|
| (-p | --pair-list) | filename | yes | The name of a file containing a list of accession code pairs, one pair per line. |
list
This mode is used to scan a list of accession codes. It is implemented by edu.cornell.med.icb.tissueinfo.similarity.ListSearchMode.
| Flag | Arguments | Required | Description |
|---|---|---|---|
| (-i | --input) | filename | yes | Name of a file containing a list of query sequence accession codes, one per line or the name "all" which will search with all transcripts |
| (-w | --weights) | n/a | no | Weight tissues using counts in restrict list. When this option is active, tissues from the restrict list are used to calculate tissue weights. Weights are then used during scoring. |
transcript-list
It is implemented by edu.cornell.med.icb.tissueinfo.similarity.TranscriptListMode.
| Flag | Arguments | Required | Description |
|---|---|---|---|
| (-g | --gene-to-transcripts-file) | filename | yes | Name of the file containing the association between Gene IDs and Trasncript IDs (tab delimited format). |
| (-t | --type-of-transcripts) | type | yes1,2 | Type of transcripts (Values: all, nonzero, zero). |
| (-i | --input) | filename | yes1 | Name of file containing a list of accession codes (either Ensembl Gene IDs or Ensembl Transcript IDs). |
| (-n | --number-of-transcripts-per-gene) | n | no | Limit transcript output to n per gene. (Value: int). (default: no limit) |
| (-r | --random-sample) | sample | no3 | Random sample of transcripts (Value: int). (default: 0) |
| (-a | --tabulated-output) | n/a | no2 | Tabulated output in the form "query ID<tab>Target ID" |
| 1One of options -i and -t are required, but are mutually exclusive and cannot be used together
2 Options -t and -r cannot be used along with option -a. 3 The random sampling will be taken after the number of transcripts are limited when option -n is specified. | |||
tally-scores
This mode performs a genome x genome search and tally the number of scores that fit into configurable score bins. It is implemented by edu.cornell.med.icb.tissueinfo.similarity.TallyScoreSearchMode.
| Flag | Arguments | Required | Description |
|---|---|---|---|
| (-x | --num-bins) | num-bins | no | The number of score bins to use for tally. (default: 100) |
| --score-span | score-span | no | The maximum span of score values that will be recorded in the bins.(default: 2000) |
group-scoring
It is implemented by edu.cornell.med.icb.tissueinfo.similarity.GroupScoringMode.
| Flag | Arguments | Required | Description |
|---|---|---|---|
| (-i | --input) | filename | yes | Name of a file containing a list of query sequence accession codes, one per line or the name "all" which will search with all transcripts |
| (-w | --weights) | n/a | no | Weight tissues using counts in restrict list. When this option is active, tissues from the restrict list are used to calculate tissue weights. Weights are then used during scoring. |
forward-screen
It is implemented by edu.cornell.med.icb.tissueinfo.similarity.ForwardMode.
| Flag | Arguments | Required | Description |
|---|---|---|---|
| (-i | --input) | filename | yes | Name of a file containing a list of query sequence accession codes, one per line or the name "all" which will search with all transcripts |
| (-w | --weights) | n/a | no | Weight tissues using counts in restrict list. When this option is active, tissues from the restrict list are used to calculate tissue weights. Weights are then used during scoring. |
