# BPrefTrecEval2006

We have been working on a Java implementation of the Python script "trecgen2006_score.py" (for scoring Trec submissions) but writing it in such a way as to simplify adding new scoring algorithms. One of the scoring algorithms we wanted to add was Bpref. We read the paper describing Bpref by Chris Buckly, Retrieval Evaluation with Incomplete Information but the algorithms we designed based on reading the paper never matched the values from the official C program trec_eval. After inspecting the C program, we found some differences between how the Bpref algorithm is described in the paper and how it is implemented in trec_eval.

Two files are needed to perform Bpref. The first is a qrels file. This is a file that details, per topic, the pmids that are judged relevant and the pmids that are judged non-relevant. A sample qrels file might look like

160	0	A	1
160	0	B	0
160	0	C	1
160	0	D	1
160	0	E	1
160	0	F	0
160	0	H	1
160	0	I	1
160	0	J	1
160	0	K	1
160	0	L	0
160	0	M	0
160	0	N	1
160	0	P	0
160	0	Q	0
160	0	R	1
160	0	S	0
160	0	T	0
160	0	U	0
160	0	W	1
160	0	X	0
160	0	Z	1


where the columns are topic ID, <unknown>, pmid, and relevance. If relevance is 0, that pmid has been judged non-relevant for the topic. If relevance is non-0, it has been judged relevant.

The second file that is necessary is a submission file. A sample submission file might look like

160	A	0	1.0000	0	0	R
160	B	1	1.0000	0	0	I
160	C	2	1.0000	0	0	R
160	D	3	1.0000	0	0	R
160	E	4	1.0000	0	0	R
160	F	5	1.0000	0	0	I
160	G	6	1.0000	0	0	-
160	H	7	1.0000	0	0	R
160	I	8	0.9800	0	0	R
160	J	9	0.9722	0	0	R
160	K	10	0.93	0	0	R
160	L	11	0.93	0	0	I
160	M	12	0.93	0	0	I
160	N	13	0.93	0	0	R
160	O	14	0.93	0	0	-
160	P	15	0.93	0	0	I
160	Q	16	0.93	0	0	I
160	R	17	0.93	0	0	R
160	S	18	0.92	0	0	I
160	T	19	0.92	0	0	I
160	U	20	0.91	0	0	I
160	V	21	0.89	0	0	-
160	W	22	0.88	0	0	R
160	X	23	0.88	0	0	I
160	Y	24	0.83	0	0	-
160	Z	25	0.83	0	0	R


where the columns are topic ID, pmid, rank, score, start position, length, and tag. In this sample I have marked the tag "R" if the line (based on the qrels file) is judged relevant, "I" if the line is judged non-relevant, and "-" if the specified pmid is non-judged.

In the Bpref algorithm, the variable R is defined as "the number of relevant judgements" (per topic). This value comes from the qrels file, not the submission file. For this example, looking at the qrels file we see that R = 12. We will also take note of the variable N, which we will define as "the number of non-relevant judgements" (per topic). Looking again at the qrels file (not the submission file) we see that N = 10.

Buckley defines bpref as follows

$bpref = \frac{1}{R} \sum_r{\left(1 - \frac{number\ of\ n\ above\ r}{R}\right)}$

Here we are stepping through every line that was judged relevant (shown with a tag of "R" in the sample) in the submission file by order of decreasing score / increasing rank, r representing the current, relevant line of the submission file. The "number of n above r" is the number of submission entries judged non-relevant (shown with a tag of "I" in the sample) above the current (judged relevant) entry.

We found that strictly implementing this formula did not provide values that matched what trec_eval produced. After looking in the code, we discovered the that the formula is actually

$bpref = \frac{1}{R} \sum_r{\left(1 - \frac{min(number\ of\ n\ above\ r, R)}{min(N, R)}\right)}$

So the actual sum for our sample submission file is

$\begin{matrix} R = 12 \\ N = 10 \end{matrix}$
$bpref = \frac{1}{12} \left[ \left(1 - \frac{0}{10}\right) + \left(1 - \frac{1}{10}\right) + \left(1 - \frac{1}{10}\right) + \left(1 - \frac{1}{10}\right) + \left(1 - \frac{2}{10}\right) + \left(1 - \frac{2}{10}\right) + \left(1 - \frac{2}{10}\right) + \left(1 - \frac{2}{10}\right) + \left(1 - \frac{4}{10}\right) + \left(1 - \frac{6}{10}\right) + \left(1 - \frac{9}{10}\right) + \left(1 - \frac{10}{10}\right) \right]$
$\begin{matrix} bpref = 0.66666667 \end{matrix}$