# TissueInfo

## TissueInfo

TissueInfo is a bioinformatics pipeline to calculate the tissue expression profile of transcripts, ESTs or proteins. More background information about the project can be found on the TissueInfo project page.

This wiki page is concerned with ongoing or future improvements to TissueInfo.

### Calculating tissue expression profile similarity

When TissueInfo has produced transcript expression profiles, how can we quantify the similarity between the expression profile of two transcripts? A quantitative similarity measure would be useful to cluster transcripts based on their tissue expression, or just to assess how compatible are the tissue expression profiles of transcripts in a gene list. To illustrate this discussion, consider the three transcripts t1,t2 and t3 expressed as shown in Table 1 below.

Table 1. Sample expression profiles. Numbers are counts of ESTs for that transcript in the given tissue.
transcript id hypothalamus hippocampus brain liver
t1 0 1 10 232
t2 1 3 50 25
t3 0 0 40 100

#### Naive distance with EST counts

We could cluster transcripts based on the counts of ESTs in each tissue tested. The quantitative measure would be a distance calculated between points with n dimensions, if n tissues where tested. The coordinate of each transcript could simply be the number of times the transcript is detected in an EST library made from the tissue.

If we use an euclidian distance $d(t_1,t_2)= \sqrt{(0-1)^2+(3-1)^2+(50-10)^2+(25-232)^2}$. Therefore, d(t1,t2) = 210.84 and d(t1,t3) = 135.37. These distances suggest that t1 is closer to t3 than to t2. Yet, t1 and t2 both have matches in hippocampus, a tissue where expression is infrequently reported. Intuitively, this match should count more than expression in liver, which is common for many genes.

#### Expression confidence scores

We could transform the EST counts into measures of how much we trust expression in a given tissue. For instance, we know that the EST count 1 and 3 in hippocampus carries the same information: expression was detected in hippocampus for both t1 and t2. The difference 1 to 3 is within sampling error and it would be unwise to conclude that t2 is expressed three times more than t1 in hippocampus.

Table 2. Confidence scores for data in Table 1.
transcript id hypothalamus hippocampus brain liver
t1 0 1 2 3
t2 1 1 2 2
t3 0 0 2 2

The euclidian distance calculated with confidence scores yields $d(t_1,t_2)= \sqrt{(0-1)^2+(1-1)^2+(2-2)^2+(2-3)^2}=1.41$. Similarly, d(t1,t2) = 1.41.

#### Scoring with confidence scores

We could quantify expression profile similarity with an aggregates of confidence scores (inspired by sequence similarity scores). Here, we sum contributions for each tissue. The term is positive if the confidence score of both transcript is positive (we can take the min of each confidence score), as in s(t1,t2) = − 1 + min(1,1) + min(2,2) + min(3,2) = − 1 + 1 + 2 + 2 = 4. The tissue contribution is zero when the confidence score is zero in both transcripts. Finally, the score contribution is taken to be minus the confidence score of the transcript for which expression is detected in the tissue while it is not detected in the other tissue.

Formally,
 s(ti,tj) = ∑ S(ti,tj,t) t = tissue

where $S(t_i,t_j,t)=\begin{cases} min(confidence(t_i,t),confidence(t_j,t)) \mbox{ if } confidence(t_i,t)>0 \mbox{ and } confidence(t_j,t)>0 \\ 0, \mbox{ if } confidence(t_i,t)=0 \mbox{ and } confidence(t_j,t)=0 \\ -confidence(t_i,t), \mbox{ if } confidence(t_i,t)>0 \mbox{ and } confidence(t_j,t)=0 \\ -confidence(t_j,t), \mbox{ if } confidence(t_j,t)>0 \mbox{ and } confidence(t_i,t)=0 \\ \end{cases}$

This formulation yields s(t1,t3) = 0 + − 1 + min(2,2) + min(3,2) = − 1 + 2 + 2 = 3and s(t1,t2) = 4 > s(t1,t3) suggests that t1 and t2 have closer expression profiles than t1 and t3.