PValue Calculation FastaTwease

From Icbwiki

Jump to: navigation, search

We want to calculate P-values from the raw-score obtained by BM25 scoring sequences with Twease. In order to do this, we first study score samples obtained when querying a shuffled sequence database. The shuffled sequence database represents the null model that we will use to derive the P-value for observed scores.

Contents

Shuffled database preparation

We shuffle residues inside of sequences and index the resulting document. This shuffling method conserves the sequence length distribution observed in the real database, and the residue composition of each sequence in the database.

Since we need to estimate a distribution from sampling, we create shuffled databases larger than the original sequence database. We do this by indexing exact copies of the same real database in sequence. Since we use the same random number generator for each copy, and the seed is never reset, we effectively shuffle each database copy differently.

Storing raw scores for the whole database

In contrast to a usual information retrieval experiment, we are not just interested in the top results. We need to collect scores for all documents/sequences in the database. I changed evaluation.Evaluate to serialize an array of double[] scores when a fasta file is used as input. The scored are saved in files called ENSP*-scores.ser, where ENSP* is the accession code for the query sequence obtained from the fasta file. The class evaluation.ProcessScores reads such serialized files and output some statistics.

Score histogram in shuffled database vs real database

Figure 1. Distribution of scores for a real protein against the human protein db shuffled 1x.
Enlarge
Figure 1. Distribution of scores for a real protein against the human protein db shuffled 1x.
Figure 2. Distribution of scores for a real protein against the human protein db.
Enlarge
Figure 2. Distribution of scores for a real protein against the human protein db.


We search the databases with query sequence ENSP00000343957. Compare the distribution of scores in a shuffled database (Figure 1 ) vs in a real database (Figure 2). It is clearly apparent that the high scores obtained in the real database cannot be explained by the distribution seen when searching the shuffled database. Fabien Campagne 15:10, 31 October 2006 (EST)

Fitting the score distribution

Is the distribution of scores in the shuffled database normal, exponential, or something else? To answer this question empirically, we bin scores by equal ranges (i.e., putting in a bin all the matches that fall within the fixed-length score range). For instance, we can bin the sequence hits into 20 bins, as shown below:

 score range in bin    tally of hits in bin 
 01: [109.77,112.06)	17526.00000000000000000
 02: [112.06,114.36)	12661.00000000000000000
 03: [114.36,116.66)	11029.00000000000000000
 04: [116.66,118.95)	8882.00000000000000000
 05: [118.95,121.25)	5697.00000000000000000
 06: [121.25,123.55)	5161.00000000000000000
 07: [123.55,125.85)	3114.00000000000000000
 08: [125.85,128.14)	2644.00000000000000000
 09: [128.14,130.44)	2102.00000000000000000
 10: [130.44,132.74)	1539.00000000000000000
 11: [132.74,135.03)	744.00000000000000000
 12: [135.03,137.33)	763.00000000000000000
 13: [137.33,139.63)	518.00000000000000000
 14: [139.63,141.93)	250.00000000000000000
 15: [141.93,144.22)	180.00000000000000000
 16: [144.22,146.52)	87.00000000000000000
 17: [146.52,148.82)	89.00000000000000000
 20: [153.41,155.71]	92.00000000000000000

Armed with these tallies, we can now compare the distribution with well known distributions. We could use the Chi-Square goodness of fit test, or the One-sample Kolmogorov-Smirnov Test.

With these frequencies, S-Plus produces the following test report:

 	One-sample Kolmogorov-Smirnov Test
 	Hypothesized distribution = poisson
 data:  normalizedScore.count in frequencies 
 ks = 0.2778, p-value = 0.0475 
 alternative hypothesis: True cdf is less than the poisson distn. with the specified parameters 

This result means that the test cannot refute that the sample is generated by a poisson distribution (at the 1% significance level). However, at the 5% level, the test refutes the distribution.

Now testing the exponential distribution on another sequence and with the 100x shuffling db:

 	One-sample Kolmogorov-Smirnov Test
 	Hypothesized distribution = exponential
 
 data:  normalizedScore.count in frequencies 
 ks = 0.2878, p-value = 0.1146 
 alternative hypothesis: True cdf is not the exponential distn. with the specified parameters

This suggests that the score distribution in a random database follows an exponential distribution. The result need to be checked with scores obtained by querying with several proteins against the large 100x shuffled database.

Exponential distribution fit

As suggested by the previous tests, we can approximate the score distribution in a random database with an [exponential distribution]. An exponential distribution has one parameter, λ, which can be estimated from a distribution sample using the relation λ = 1 / mean.

The cumulative distribution function (cdf) of an exponential distribution is given by:

F(x;\lambda) = \left\{\begin{matrix} 1-e^{-\lambda x}&,\; x \ge 0, \\ 0 &,\; x < 0. \end{matrix}\right.

When the cdf F(x;λ) of the distribution is known, the P-value of observing a positive score S in a sample generated from the distribution is

P(S;λ) = 1 − F(S;λ) = e − λS

The exponential distribution is just an approximation to the distribution observed in the shuffled database, therefore these equations do not give an exact P-value. Instead, they can be used to obtain an approximation of the P-value.

We can use the Kolmogorov-Smirnov Test to determine the direction of the approximation.

 	One-sample Kolmogorov-Smirnov Test
 	Hypothesized distribution = exponential
   
 data:  normalizedScore.count in frequencies 
 ks = 0.2878, p-value = 0.0573 
 alternative hypothesis: True cdf is greater than the exponential distn. with the specified parameters 
 	One-sample Kolmogorov-Smirnov Test
 	Hypothesized distribution = exponential
 
 data:  normalizedScore.count in frequencies 
 ks = 0.0946, p-value = 0.7087 
 alternative hypothesis: True cdf is less than the exponential distn. with the specified parameters 

These results suggest that cdf exponential < cdf sample. This translates into P-value exponential > P-value sample and means that the P-value obtained by assuming an exponential distribution of scores in a shuffled database will yield a conservative P-value (the P-value reported will always be higher than the "true" P-value that would be calculated if the shuffled score distribution was known).

Personal tools