# Clustering terms into independent classes

### From Icbwiki

We start with the set of unique terms *T* and want to create a mapping for each element .

Essentially, if we have a similarity measure between terms, we can cluster terms with this measure and obtain the mapping from terms to classes (a class is a cluster).

Useful similarity measures include those that use information about pairs of terms and how frequently the terms in the pair occur in the same document. Formally, they use information about the frequency of the term in the whole corpus (number of documents in which the term appears), frequency for the other term, and the count of documents that contain both terms and .

To cluster *n* terms into classes, we need to pre-calculate a matrix filled with the counts. Pre-calculating this matrix efficiently will be key since *n* can be large (e.g., *n* > 100).

The new MG4J ordered AND operator suggests a scalable approach.

Suppose we perform the query (*t*_{1} | *t*_{i} | ... | *t*_{n}) > (*t*_{1} | *t*_{i} | ... | *t*_{n}). This query will return documents that contain at least two non-overlapping terms. The same term may occur in the same document and be returned, but in that case it will occur at a different position in the document. We can navigate the iterator corresponding to this query. Each interval matching the query will contribute to one element of the count matrix.

Therefore, for each matching interval, determine:

- the identity of the two terms that matched (i.e., determine the index of the term in the count matrix , ).
- if the same terms matched the same document, then the matrix is unchanged.
- else

See also Latex primer [1].