TREC Genomics Track Procedure

From Icbwiki

Jump to: navigation, search

This page describes how we generated the runs submitted to the 2006 TREC genomics track.

The corpus was prepared as a mixed sentence-level/full article-level index. This is achieved by indexing the position of sentence buondaries and filtering out those matches that span boundaries, when sentence level searches are required.


Contents

All Runs.

All runs are performed with the Twease slider at position 200. At this position, the slider expands the query with all the morphological word variants, abbreviations, and MeSH synonyms that match the query words. Morphological word variants are discovered at runtime, with a statistical model trained on Medline 2006 (Campagne, F. unpublished, 2006).

Passages are assigned as the minimal intervals where the query match the documents.

Corpus preparation

A new corpus is generated with TweaseWordReader2 and stored at harpo:/dat/scratch/campagne/trec-2006/October06/textractor/trec-2006/index

Run 1.

Run 1 was performed with queries at the full article level only. Slider position 200. In this run, we used the MG4J Vigna scorer as baseline. The Vigna scorer favors matches where search terms appear in short text intervals.

Run 2.

Run 2 was performed with parts of the queries at the sentence-level, when appropriate, other terms matching the rest of the article, and ranking by context. Slider at position 200. Context ranking is a new ranking strategy implemented in our textractor framework for the 2006 TREC genomics track. Context queries are expressed as (query)/(context). Briefly, context ranking allows to rank documents matching query by a context, specified as a query expression (e.g., "colon cancer" as a phrase or keywords with boolean clauses). The words in the context do not necessarily occur in the document being ranked. The documents matching the context part of the query are used to infer words that are associated with the context in the corpus. These words are then used to rank the specific set of documents.

Run 3.

Run 3 was performed with queries at the full article level, ranked by context as in Run 2. The context of queries in Run 2 were added to queries from Run 1 to form queries for this run. Slider at position 200. For each topic, queries have the form (query run 1) / (context run 2).

Preparing Runs.

  • Runs are generated with TrecEvaluate in evaluation.jar, in twease/lib. The package evaluation.jar is produced when compiling module TweaseEvaluation in the SVN repository.

An ant target is available in build.xml for each run, and called evaluate-trec-2006-run1

  • Run locator:
 java -Xmx900m -jar locator.jar -unique 
 -batch trec-gen-twease-sliderValue-0-scorer-vigna-run-MANUAL-blind-expansion-false-expander-BlindQueryExpander4.txt 
 -basename /homesK/campagne/projects/TREC/stemmer/stem-indices/trec-2004-biostem/index
 -l /dat/scratch/mjw/trec/2006/ -writer eval-located-2006-1.txt 

See also locator run automations.

  • Sort by topic and rank:
 sort -k 1n -k 3n eval-located-2006-3.txt-unique > eval-sorted-2006-3.txt-unique
  • Run RenumberTrecSubmission (Note: the following step appears to no longer be necessary, the "-unique" file generated previously seems to already be "renumbered".)
 java textractor.util.RenumberTrecSubmission  -i H:\projects\locator\Locator\eval-sorted-2006-2.txt-unique 
 -o H:\projects\locator\Locator\eval-reranked-2006-2.txt

eval-reranked-2006-2.txt is the final result

Check submission with TREC gen run validation tool.

Validating Runs.

Submission Format

Submissions to the TREC Genomics Track contain the following data elements, separated by white space (i.e., tab character):

  • Topic ID - from 160 to 187.
  • Doc ID - name of the HTML file minus the .html extension. This is the PMID that has been designated by Highwire, even though we now know that this may not be the true PMID assigned by the NLM (i.e., used in MEDLINE). But this is the official identifier for the document.
  • Rank number - rank of the passage for the topic, starting with 1 for the top-ranked passage and preceding down to as high as 1000.
  • Rank value - system-assigned score for the rank of the passage, an internal number that should descend in value from passages ranked higher.
  • Passage start - the byte offset in the Doc ID file where the passage begins, where the first character of the file is offset 0.
  • Passage length - the length of the passage in bytes, in 8-bit ASCII, not Unicode.
  • Run tag - a tag assigned by the submitting group that should be distinct from all the group's other runs (and ideally any other group's runs, so it should probably have the group name, e.g., ICB).

There should be a maximum of 1000 results per Topic ID. Additionally, if there happen to be no matches for a particular Topic ID (as in the case of topic 170 in the example below), a "dummy" result should be added to the output. For this type of entry, the Doc ID and Passage Start should be set to "0" and the Passage length should be set to "1".

Here are excepts from the submission results from Run 1:

  160 10811947 1 0.999892 25625 25 icb1
  160 10811947 2 0.999892 2828 10 icb1
  160 15722547 3 0.9991344 4247 10 icb1
  160 15722547 4 0.9991344 47752 46 icb1
    ...
  169 10484474 1000 0.7305642 40924 79 icb1
  170 0 1 1 0 1 icb1
  171 11147797 1 0.012987013 347 286 icb1
  171 11147797 2 0.012987013 69919 428 icb1
    ..

Validation Script

A script which can be used to validate that the submissions adhere to the proper format was downloaded from Tracks Homepage. The script is written in Perl and is called check_genomics.pl. The script will identify problems with the submission such as missing fields, topic id's out of range, invalid tag names, no results for a topic or topics, etc. Additionally, warnings will be generated for any topics that contain less than 1000 entries. This does not necessarily mean there is problem, but more than 1000 entries are certainly flagged as an error.

The script is executed by passing the name of the file to check as an argument on the command line. For example, if the submission file is called "eval-reranked-2006-2.txt", the script is executed as follows:

  $ ./check_genomics.pl eval-reranked-2006-2.txt

Output of the validation script is placed into a file called "eval-reranked-2006-2.txt.errlog".

Validation Results

Personal tools