PValue Calculation FastaTwease

We want to calculate P-values from the raw-score obtained by BM25 scoring sequences with Twease. In order to do this, we first study score samples obtained when querying a shuffled sequence database. The shuffled sequence database represents the null model that we will use to derive the P-value for observed scores.

Shuffled database preparation

We shuffle residues inside of sequences and index the resulting document. This shuffling method conserves the sequence length distribution observed in the real database, and the residue composition of each sequence in the database.

Since we need to estimate a distribution from sampling, we create shuffled databases larger than the original sequence database. We do this by indexing exact copies of the same real database in sequence. Since we use the same random number generator for each copy, and the seed is never reset, we effectively shuffle each database copy differently.

Storing raw scores for the whole database

In contrast to a usual information retrieval experiment, we are not just interested in the top results. We need to collect scores for all documents/sequences in the database. I changed evaluation.Evaluate to serialize an array of double[] scores when a fasta file is used as input. The scored are saved in files called ENSP*-scores.ser, where ENSP* is the accession code for the query sequence obtained from the fasta file. The class evaluation.ProcessScores reads such serialized files and output some statistics.

Score histogram in shuffled database vs real database

We search the databases with query sequence ENSP00000343957. Compare the distribution of scores in a shuffled database (Figure 1 ) vs in a real database (Figure 2). It is clearly apparent that the high scores obtained in the real database cannot be explained by the distribution seen when searching the shuffled database. Fabien Campagne 15:10, 31 October 2006 (EST)

Fitting the score distribution

Is the distribution of scores in the shuffled database normal, exponential, or something else? To answer this question empirically, we bin scores by equal ranges (i.e., putting in a bin all the matches that fall within the fixed-length score range). For instance, we can bin the sequence hits into 20 bins, as shown below:

score range in bin    tally of hits in bin
01: [109.77,112.06)	17526.00000000000000000
02: [112.06,114.36)	12661.00000000000000000
03: [114.36,116.66)	11029.00000000000000000
04: [116.66,118.95)	8882.00000000000000000
05: [118.95,121.25)	5697.00000000000000000
06: [121.25,123.55)	5161.00000000000000000
07: [123.55,125.85)	3114.00000000000000000
08: [125.85,128.14)	2644.00000000000000000
09: [128.14,130.44)	2102.00000000000000000
10: [130.44,132.74)	1539.00000000000000000
11: [132.74,135.03)	744.00000000000000000
12: [135.03,137.33)	763.00000000000000000
13: [137.33,139.63)	518.00000000000000000
14: [139.63,141.93)	250.00000000000000000
15: [141.93,144.22)	180.00000000000000000
16: [144.22,146.52)	87.00000000000000000
17: [146.52,148.82)	89.00000000000000000
20: [153.41,155.71]	92.00000000000000000

Armed with these tallies, we can now compare the distribution with well known distributions. We could use the Chi-Square goodness of fit test, or the One-sample Kolmogorov-Smirnov Test.

With these frequencies, S-Plus produces the following test report:

One-sample Kolmogorov-Smirnov Test
Hypothesized distribution = poisson
data:  normalizedScore.count in frequencies
ks = 0.2778, p-value = 0.0475
alternative hypothesis: True cdf is less than the poisson distn. with the specified parameters

This result means that the test cannot refute that the sample is generated by a poisson distribution (at the 1% significance level). However, at the 5% level, the test refutes the distribution.

Now testing the exponential distribution on another sequence and with the 100x shuffling db:

One-sample Kolmogorov-Smirnov Test
Hypothesized distribution = exponential

data:  normalizedScore.count in frequencies
ks = 0.2878, p-value = 0.1146
alternative hypothesis: True cdf is not the exponential distn. with the specified parameters

This suggests that the score distribution in a random database follows an exponential distribution. The result need to be checked with scores obtained by querying with several proteins against the large 100x shuffled database.

Exponential distribution fit

As suggested by the previous tests, we can approximate the score distribution in a random database with an [exponential distribution]. An exponential distribution has one parameter, λ, which can be estimated from a distribution sample using the relation λ = 1 / mean.

The cumulative distribution function (cdf) of an exponential distribution is given by: $F(x;\lambda) = \left\{\begin{matrix} 1-e^{-\lambda x}&,\; x \ge 0, \\ 0 &,\; x < 0. \end{matrix}\right.$

When the cdf F(x;λ) of the distribution is known, the P-value of observing a positive score S in a sample generated from the distribution is

P(S;λ) = 1 − F(S;λ) = e − λS

The exponential distribution is just an approximation to the distribution observed in the shuffled database, therefore these equations do not give an exact P-value. Instead, they can be used to obtain an approximation of the P-value.

We can use the Kolmogorov-Smirnov Test to determine the direction of the approximation.

One-sample Kolmogorov-Smirnov Test
Hypothesized distribution = exponential

data:  normalizedScore.count in frequencies
ks = 0.2878, p-value = 0.0573
alternative hypothesis: True cdf is greater than the exponential distn. with the specified parameters
One-sample Kolmogorov-Smirnov Test
Hypothesized distribution = exponential

data:  normalizedScore.count in frequencies
ks = 0.0946, p-value = 0.7087
alternative hypothesis: True cdf is less than the exponential distn. with the specified parameters

These results suggest that cdf exponential < cdf sample. This translates into P-value exponential > P-value sample and means that the P-value obtained by assuming an exponential distribution of scores in a shuffled database will yield a conservative P-value (the P-value reported will always be higher than the "true" P-value that would be calculated if the shuffled score distribution was known).