Morphological parser

From Icbwiki

Jump to: navigation, search

A morphological parser can be used to parse a sequence of characters (representing a word in a language) and associate with each character of the sequence a tag that indicates if the character belongs to a prefix, stem or suffix of the word.

For instance, when a morphological parser is fed with the word "polyubiquitination", it may parse the tags the characters of the word as :

polyubiquitination
pppppppttttttttsss

This indicates that the parser recognized polyubi as a prefix, quitinat as a stem, and ion as a suffix of the word presented to the parser. A morphological parser is related to stemming algorithms, but can be built custom for a specific corpus dictionary.

This page describes a protocol for creating a morphological parser from a term dictionary.

  1. Checkout textractor from Subversion
  2. ant -f build/pubmed.xml biostemmer-phase1 Will scan the term dictionary to extract potential stems (part of a term that is 1. a maximal substring between the term and one of the 60 most similar terms in the dictionary and the query term, 2. occurs with maximal frequency in the 60 most similar terms to the query term).
    1. Details: java -Xmx1300m textractor.tools.biostems.StemTermDictionary -ml 2 -i D:\dev\medline-27\pubmed-index-text-dym-word.terms -o stems-biostemmer.txt -basename D:\dev\medline-27\pubmed-index] Will stem with biostem and keep stems larger than 2 characters (e.g., will keep kin from kinase, but not at). This step can be run in parallel if the term dictionary is split into parts. Use -1 option to scan the corpus first for each split. The cat the results from each split and run -2 option to tally stems.
  1. Tally the list of potential stems by frequency of occurence in the dictionary, inspect and cleanup the list to remove obvious prefix/suffixes with the highest frequency. Remove lowest frequency stem candidatess. (e.g., candidate stems: interm, react)
  1. Scan the term dictionary again to collect prefix/stem/suffix possibilities for each term, considering each stem collected in previous step. For instance, considering the following candidate stems (ordered by dictionary frequency):
47      interm
33      intermedy
31      intermit
11      intermod
8       intermedi

Produce (fragment shown only for intermedi:)

bathointermediate	batho_ate
cd3intermediate	cd3_ate
disintermediation	dis_ation
dorsointermediate	dorso_ate
dorsointermedius	dorso_us
heterointermediate	hetero_ate
intermedia	_a
intermediacy	_acy
intermediae	_ae
intermedial	_al
intermediale	_ale
intermedialis	_alis
intermediaries	_aries
intermediary	_ary
intermediate	_ate
intermediated	_ated
intermediately	_ately
intermediates	_ates
intermediating	_ating
intermediation	_ation
intermediator	_ator
intermediators	_ators
intermediatory	_atory
intermediatry	_atry
intermedical	_cal
...
  • Create an Finite State Automaton (FSA) from this data, such that the FSA encodes the character sequence and the tag.
Personal tools