Clustering terms into independent classes

We start with the set of unique terms T and want to create a mapping $c(t) \in C$ for each element $t \in T$.

Essentially, if we have a similarity measure between terms, we can cluster terms with this measure and obtain the mapping from terms to classes (a class is a cluster).

Useful similarity measures include those that use information about pairs of terms and how frequently the terms in the pair occur in the same document. Formally, they use information about $\emph f(t_i)$ the frequency of the term in the whole corpus (number of documents in which the term appears), $\emph f(t_j)$ frequency for the other term, and $f(t_i \cap t_j)$ the count of documents that contain both terms $\emph t_i$ and $\emph t_j$.

To cluster n terms into classes, we need to pre-calculate a $n \times n$ matrix filled with the $f(t_i \cap t_j)$ counts. Pre-calculating this matrix efficiently will be key since n can be large (e.g., n > 100).

The new MG4J ordered AND operator suggests a scalable approach.

Suppose we perform the query (t1 | ti | ... | tn) > (t1 | ti | ... | tn). This query will return documents that contain at least two non-overlapping terms. The same term may occur in the same document and be returned, but in that case it will occur at a different position in the document. We can navigate the iterator corresponding to this query. Each interval matching the query will contribute to one element of the count matrix.

Therefore, for each matching interval, determine:

• the identity of the two terms that matched (i.e., determine the index of the term in the count matrix $I_{t_i}$, $I_{t_j}$).
• if the same terms matched the same document, then the matrix is unchanged.
• else $M[I_{t_i},I_{t_j}]+=1$