CRT home > Textractor

[ description ]

Textractor is a software framework. The framework is designed to facilitate the development of software tools that process text to extract information. The focus of this framework is on applications that need to process large quantities of text, for instance large collections of full text articles (>10,000 articles). The development of textractor started at the Institute for Computational Biomedicine, Weill Cornell Medical College. Since we distribute the source code under the GPL, you are welcome to reuse or extend the framework in any way you like.


[ about the method ]

Software architecture of the framework is unpublished and should be cited as Textractor http://icb.med.cornell.edu/crt/textractor/ (L. Shi and F. Campagne 2004).

[ documentation ]

Supplementary material for the gene/gene product extraction article:

Precision measurements on JBC2000 (Excel)
NCB14 Evaluation results (Excel)
PMC15 Evaluation results (Excel)
The catalog of protein name references
The protein dictionary

The names collected from the last quarter of JBC1999 by regular expressions for SVM training are below:
Protein names
Cell names
Interaction names
Biological process names

You can download the protein name lookup program here:
The lookup program (JAR file, version 1.1)

To use the lookup program, you will need Java 1.4+. Download the JAR file (tlookup.jar) and type java -jar tlookup.jar for usage information.


You can also download the source code, but will need a full fledged software development environment (JDK1.4+, Ant 1.6+) and will need a suitable JDO implementation (we developed with FastObjects and have not tested porting on another JDO implementation). By downloading this distribution, you agree to the terms of the Gnu General Public License.

The Textractor source code v1.1(GPL)
See also the content of the textractor CVS repository.

The latest development snapshot of the source code archived on August 2nd 2006 is also available for download. This version requires JDK 1.5+ and a suitable JDO implementation.

The Textractor source code v20060802144806 (GPL)

A precompiled version archived on August 2nd, 2006 is also available for download. This version requires apache ant version 1.6.5 and acess to an oracle database to use.

Textractor v20060802144806 precompiled for use with Oracle.

Textractor API Documentation

Textractor API

Textractor is used in the following projects:

Twease

If you find this software useful, please let us know in a quick email.

News
Jul, 2009; ChIPseeqer, a comprehensive framework for analysis of ChIP-seq data developed in the Elemento lab, is now available for download. [More]
Apr, 2009; The BDVal program developed by the Campagne laboratory for MAQC-II is now available from http://bdval.org. The software supports the development and evaluation of predictive biomarker models from high-throughput data. The web site offers binary and source distributions. [More]
Jan, 2009; Twease now supports searching MEDLINE articles by Author, Journal, and Publication Year. Examples for performing these searches can be found in the updated Twease tutorial. [More]

[News Archives] [Mailing List]


Events
Dec 11th; 4:00pm-5:00pm: Institute for Computational Biomedicine Research in Progress Seminar Series - Fabien Campagne; ICB Conference Room - Y.1301
Jan 15th; 4:00pm-5:00pm: Institute for Computational Biomedicine Research in Progress Seminar Series - Lei Shi; ICB Conference Room - Y.1301
Feb 12th; 4:00pm-5:00pm: Institute for Computational Biomedicine Research in Progress Seminar Series - Christopher E. Mason; ICB Conference Room - Y.1301
Mar 12th; 4:00pm-5:00pm: Institute for Computational Biomedicine Research in Progress Seminar Series - Olivier Elemento; ICB Conference Room - Y.1301
Apr 9th; 4:00pm-5:00pm: Institute for Computational Biomedicine Research in Progress Seminar Series - Emre Aksay; ICB Conference Room - Y.1301
May 14th; 4:00pm-5:00pm: Institute for Computational Biomedicine Research in Progress Seminar Series - Jonathan D. Victor; ICB Conference Room - Y.1301
Jun 11th; 4:00pm-5:00pm: Institute for Computational Biomedicine Research in Progress Seminar Series - Harel Weinstein; ICB Conference Room - Y.1301
Jul 9th; 4:00pm-5:00pm: Institute for Computational Biomedicine Research in Progress Seminar Series - Duane Hassane; ICB Conference Room - Y.1301