Evaluating Morhological Word Variant Discovery

This page describes the evaluation of the morphological variants discovered automatically by the process implemented in Twease.

Evaluation Strategy

We will evaluate word variants by their impact on information retrieval. We will measure information retrieval performance on the genomics TREC experiments (corpus 2004, queries 2004 and 2005) for the following conditions:

  • raw queries
  • queries expanded with word variants at different threshold (e.g., Twease slider values).
  • corpus and queries stemmed with Porter and PaiceHusk stemmers

Corpus preparation

The corpus is prepared with about 4 million Medline abstracts distributed by the TREC Genomics track organizers. Three distinct indices are built with this corpus:

  • No stemming, tokenization and term processing as implemented in the TweaseWordReader
  • Stemming with PaiceHusk algorithm
  • Stemming with Porter algorithm

These text collections are indexed with textractor. For this purpose, the project is checked out on zeppo under /dat/scratch/dev/TREC/2004-corpus/textractor. Ant build targets are prepared in build/pubmed.xml to automate corpus preparation (see target trec-2005-abstract).

The indices are stored under:

  • /homesK/campagne/projects/TREC/stemmer/stem-indices/trec2004-biostemmer
  • /homesK/campagne/projects/TREC/stemmer/stem-indices/trec2004-paice-husk
  • /homesK/campagne/projects/TREC/stemmer/stem-indices/trec2004-porter


Results are stored under


Of special interest are the results in /homesK/campagne/projects/TREC/stemmer/sliding-probabilities which present runs with the suggestRelated terms linked to probabilities. The results were obtained with a version of Twease that created separate term equivalence class for each source of slider related terms, so we need to redo these runs when that is fixed.

