Textractor

From Icbwiki

Revision as of 17:11, 16 August 2010; view current revision
←Older revision | Newer revision→
Jump to: navigation, search

Contents

About Textractor

Textractor is a software framework developed in the laboratory of Fabien Campagne. The framework is designed to facilitate the development of software tools that process text to extract information. The focus of this framework is on applications that need to process large quantities of text, for instance large collections of full text articles (>>10,000 articles). The development of textractor started at the Institute for Computational Biomedicine, Weill Cornell Medical College. Since we distribute the source code under the GPL, you are welcome to reuse or extend the framework in any way you like.

Textractor includes the implementation of a method to rank morphological word variants. The method has objectives similar to stemming but infers word variations directly from a corpus without linguistic expertise. The method was presented in a poster at ISMB 2006. A copy of the poster is available here.

Getting Textractor

Download

Textractor can be downloaded from the Textractor Home Page.

Subversion Access

This project's Subversion repository can be checked out through SVN with the following instruction set:

 svn co https://pbtech-vc.med.cornell.edu/public/svn/icb/trunk/textractor

Browse Textractor in the Subversion repository.

Using Textractor

Software Requirements

Configuration

Textractor expects a file called textractor.properties to be on the classpath, ideally in the "config" directory of the distribution.

Creating the Database

The settings which will be used by textractor to connect to the backend database are located in the 'textractor.properties' file. The default configuration assumes a user called 'textractor' with a password of 'password' for a database running on the local machine. The textractor property file should be changed to match your own configuration. You must create the appropriate users and tablespaces/schemas before running textractor. Once the database and appropriate permissions are set up, the ant build/run scripts will typically take care of setting up the appropriate tables that textractor will use. A sql script is provided for Oracle and MySQL and are described in the following sections.

Oracle

 sqlplus textractor@localhost @create-oracle.sql

MySQL

 mysql --user=textractor --pass textractor < create-mysql.sql

Sample datasets

Getting the data from PubMed

The textractor distribution does not include any data, but this can be download from PubMed. We have provided an ant target to download a sample dataset we refer to as textractor-dataset-a. The dataset includes abstracts with examples of ambiguous terms. To download the data using the ant script, execute the following from the textractor build directory:

 ant -f binary.xml fetch-dataset-a

Loading and indexing the data

Once the data has been downloaded from pubmed, it can be loading into the database by executing the following command from the textractor base directory:

 ant -f build/binary.xml boot-ambiguity

Developing with Textractor

Software Requirements

Configuration

JDO

The textractor project was originally designed to use a fastobjects JDO database. Although recent development efforts have shifted away from JDO in favor of mg4j, a large portion of the codebase and build scripts still require some JDO settings.

Sample configuration files for fastobjects and kodo have been provided. Additionally, there is a "no jodo" configuration that will indicate that the textractor should not be enhanced for use with a JDO database. An ant script has been provided to ease database and logging configuration setup. The script is called "config.xml" and resides in the build directory of the textractor distribution. Execute the following command from the textractor base directory:

 ant -f build/config.xml type

where type is one of fastobjects, kodo-oracle, kodo-mysql or nojdo.

Log4j

Textractor is built using commons logging and sample configurations for log4j are provided. A log4j.properties will be placed into the textractor config directory when the ant configuration target described above is executed.

Building

Once textractor has been configured as listed above, ant scripts exist in the textrator "build" directory. These can be used to compile textractor for use. Running the following command from the textractor base directory will compile the appropriate java classes:

 ant -f build/build.xml compile

Testing

A number of tests have been written for textractor using the JUnit testing framework. Execute the following command to run the tests for textractor:

 ant -f build/build.xml test

Note that many tests will fail if textractor was not configured to use JDO (i.e., "nojdo") or the backend JDO database is not configured properly.

Preparing a release intended for public distribution

When preparting to release textractor to the public is very important to remember the following:

  • Do not release files and/or data that we are not allowed to redistribute (i.e., license keys, passwords, commercial libraries, etc.).
  • Releases should always be tagged in the subversion repository so that exactly what was released can be traced back to the repository.

There is an ant build file in the textractor source tree that handles the nitty gritty details of packaging a release build for textractor. The ant build file is located in the build directory of the textractor source tree and is called "release.xml". The default target will create a new tag in subversion, package the source, documentation, tlookup.jar and locator.jar from the code represented by the new tag. Note that you will need write access to the subversion repository in order to create a tag for a new release. Before making a release, you'll need have the following:

  • All the code to be released needs to be checked into subversion. The release procedure does not use the local copy of textractor to package.
  • If you are preforming a release from the main development will need to decide on a name for the tag. The release procedure will suggest a default name of the form textractor-yyyyMMddHHmmss, but you may want to provide your own tag in certain cases. If you are repackaging a release from an existing tag, you need to know the name of the tag.
  • In order to package a pre-compiled version of textractor, you'll need license keys for kodo. You will need both a development and runtime license. The development license is used to build the release and the runtime license will be packaged for other users. It is very important that we keep the development license key private.

Once the code is ready to be released and you have all the information you need, you exectute the following from the "build" directory of a local copy of textractor:

 ant -DdevelopmentLicenseKey="XXXX" -DruntimeLicenseKey="YYYY" -f release.xml

The procedure will prompt you for the tag name to use if it is not provided on the command line. The files to be released are placed into a directory named "release-${tag}" in the root of the local textractor work directory. In the release directory, the following zip files are created:

  • ${tag}-src.zip
  • ${tag}-bin.zip
  • ${tag}-locator.zip
  • tlookup.jar

Implementation Notes

Two of the basic java objects textractor defines for processing text are textractor.datamodel.Article and textractor.datamodel.Sentence. Article objects contain metadata about the original source of some text contained in some number of Sentences. Typically these java objects directly map to their corresponding "real world" equivalents such as PubMed entries, but can apply to other types of data such as sequences read from FASTA files.

Multi-threaded loading and processing

Many operations performed by textractor involve loading data from files, urls, databases etc. and passing the results on for further processing. In simple terms, the overall process can be thought of in terms of the classic "producer/consumer" model where a file loader class would "produce" parsed results which could then be "consumed" by a class that indexed the content of the results. In some cases, the processing would include some transformation of the data in between the initial "production" and the final "consumption". A simple example of a transformer would be a a class that capitalized each word generated by a text producer and passed the word onto the original consumer. Transformers can be looked at as both a producer and a consumer at the same time. Textractor has supports this producer/transformer/consumer model through the textractor.sentence.SentenceProducer, textractor.sentence.SentenceConsumer and textractor.sentence.SentenceTransformer interfaces. All three of these share a common superinterface called textractor.sentence.SentenceProcessor which provides some simple tracking statistics and notification of processing events.

Generally speaking, even though the content and the format of the data used in textractor may differ, the flow of processing is similar if not identical. Furthermore, the datasets can be quite large and in many cases, processing can begin before all the data has been read. This type of execution flow lends itself to processing with multiple threads rather than serially. To support a threaded execution model in a generic fashion, textractor has adopted the "Chain of Responsibility" pattern by extending the Commons Chain API. The default behavior of the commons chain implementation is to execute each sub-command within a single chain in series. In textractor chains (producers and transformers) start all sub-commands (consumers and transformers) in a new thread. The details of the data structures used to pass data between threads is kept hidden from implementing classes.

"Catalogs" used to load and index with mg4j can be found in config/catalogs directory of the source distribution. A ChainExecutor class that may be used to initiate a chain load/index is available. Many examples of how these are used can be found in various ant build scripts in the build directory of the source distribution.

Medline Index

The medline distribution files are mirrored locally in the icb home account at ~icb/db/pubmed. A cron job running nightly from the account "icbmirror" on euros.med.cornell.edu keeps the mirror up to date.

Weekly Updates

A new index of the current medline data files is generated weekly by the cruisecontrol project called "medline". The results of the medline index builds are stored into ~twease/medline/index using the name of the cruisecontrol build (i.e., "medline-56").

Miscellaneous

"No Title" Index

Indexes without including titles have been generated and are located in ~icb/textractor/medline-notitle. The indexes were generated using three different stemming methods: twease, paice-husk and porter.

A random sampling of titles from the medline set used for the index is contained in a file called random-medline-titles.tsv. Only those articles that contain both a title and an abstract were used to generate this set.

"No Title" Evaluation

Evaluation results are located in ~icb/textractor/medline-notitle/evaluation. The runs were generated using the three stemming methods. For the twease method, slider values of 0-200 inclusive in increments of 20 were used. At the present time, results are available for 1,000 (1k) titles only. Runs with 100 and 10,000 (10k) titles are in progress.

More Information

Personal tools