TissueInfo
From Icbwiki
Contents |
TissueInfo
TissueInfo is a bioinformatics pipeline to calculate the tissue expression profile of transcripts, ESTs or proteins. More background information about the project can be found on the TissueInfo project page.
This wiki page is concerned with ongoing or future improvements to TissueInfo.
Calculating tissue expression profile similarity
When TissueInfo has produced transcript expression profiles, how can we quantify the similarity between the expression profile of two transcripts? A quantitative similarity measure would be useful to cluster transcripts based on their tissue expression, or just to assess how compatible are the tissue expression profiles of transcripts in a gene list. To illustrate this discussion, consider the three transcripts t1,t2 and t3 expressed as shown in Table 1 below.
| transcript id | hypothalamus | hippocampus | brain | liver |
|---|---|---|---|---|
| t1 | 0 | 1 | 10 | 232 |
| t2 | 1 | 3 | 50 | 25 |
| t3 | 0 | 0 | 40 | 100 |
Naive distance with EST counts
We could cluster transcripts based on the counts of ESTs in each tissue tested. The quantitative measure would be a distance calculated between points with n dimensions, if n tissues where tested. The coordinate of each transcript could simply be the number of times the transcript is detected in an EST library made from the tissue.
If we use an euclidian distance
. Therefore, d(t1,t2) = 210.84 and d(t1,t3) = 135.37. These distances suggest that t1 is closer to t3 than to t2. Yet, t1 and t2 both have matches in hippocampus, a tissue where expression is infrequently reported. Intuitively, this match should count more than expression in liver, which is common for many genes.
Expression confidence scores
We could transform the EST counts into measures of how much we trust expression in a given tissue. For instance, we know that the EST count 1 and 3 in hippocampus carries the same information: expression was detected in hippocampus for both t1 and t2. The difference 1 to 3 is within sampling error and it would be unwise to conclude that t2 is expressed three times more than t1 in hippocampus.
| transcript id | hypothalamus | hippocampus | brain | liver |
|---|---|---|---|---|
| t1 | 0 | 1 | 2 | 3 |
| t2 | 1 | 1 | 2 | 2 |
| t3 | 0 | 0 | 2 | 2 |
The euclidian distance calculated with confidence scores yields
. Similarly, d(t1,t2) = 1.41.
Scoring with confidence scores
We could quantify expression profile similarity with an aggregates of confidence scores (inspired by sequence similarity scores). Here, we sum contributions for each tissue. The term is positive if the confidence score of both transcript is positive (we can take the min of each confidence score), as in s(t1,t2) = − 1 + min(1,1) + min(2,2) + min(3,2) = − 1 + 1 + 2 + 2 = 4. The tissue contribution is zero when the confidence score is zero in both transcripts. Finally, the score contribution is taken to be minus the confidence score of the transcript for which expression is detected in the tissue while it is not detected in the other tissue.
Formally,| s(ti,tj) = | ∑ | S(ti,tj,t) |
| t = tissue |
where
This formulation yields s(t1,t3) = 0 + − 1 + min(2,2) + min(3,2) = − 1 + 2 + 2 = 3and s(t1,t2) = 4 > s(t1,t3) suggests that t1 and t2 have closer expression profiles than t1 and t3.
More information
Some data regarding the test of this idea are presented on the TEPSS page.
We are currently preparing a manuscript for publication about this extension of TissueInfo. Contact me if you would like to try a pre-release version of TissueInfo Fabien Campagne 11:30, 8 August 2007 (EDT)
Future developments
This page lists ideas for future development of TissueInfo.
