Vocabulary

Moose-Algos-InformationRetrieval makes up the basics of Hapax, a suite of prototypical tools developed by Adrian Kuhn to perform vocabulary analysis on documents.

Moose-Algos-InformationRetrieval provides the following:

  • data classes to handle corpus of documents and terms
  • scanners to parse source text into terms/tokens (words and camelcase words)
  • stemmers to compute the generic root of each word (for English and German) i.e., ’computer’ and ’computing’ share the same root ’comput’ - this is often used to give more weight to a concept with multiple shapes
  • stop words for the English language (list of common and frequent terms in a language which are irrelevant to the domain)
  • a log-likelihood ratio calculator, which can compare any two corpus of terms and retrieve most characteristic terms for each corpus and most common terms among the two corpus.

Building a corpus of terms

A MalTerms instance contains the list of words retrieved from any document with source text and parsed by a scanner. It’s actually a subclass of Bag so you can easily get the number of occurrences for each word.

For example, execute:

MalTerms fromString:
'A survey of user opinion of computer system response time'
--> a MalTerms('user' 'of' 'of' 'survey' 'system' 'a'
'opinion' 'computer' 'response' 'time')
self sortedCounts
--> {(2->'of'). (1->'user'). (1->'survey'). (1->'system').
(1->'a'). (1->'opinion'). (1->'computer'). (1->'response').
(1->'time')}

The MalTerms class is really the workhorse of Moose-Algos-IR: you can perform stemming, pruning, selection operations on a MalTerms instance to further refine the list of words. Let’s use a more complete sample:

terms := MalTerms sample.
terms sortedCounts.
--> {(6->'a'). (5->'tree'). (3->'is'). (2->'i'). (2->'that').
(2->'root'). (2->'the'). (2->'are'). (2->'by'). ...}

It looks interesting but there is also lots of noise due to common words like ’a’ ’is’ ’the’... We will start by removing such tokens which are English stop words (i.e. words too common to be of meaningful value).

terms removeStopwords
--> {(5->'tree'). (2->'root'). (2->'paths'). (1->'elected').
(1->'lan'). (1->'folks'). (1->'cost'). ...}

Now let’s use the Porter stemmer algorithm to group together words with the same root.

terms stemAll
--> {(5->'tree'). (2->'span'). (2->'root'). (2->'path').
(1->'lan'). (1->'folk'). (1->'elect'). ...}

It didn’t do much but notice how we now have the term ’span’ with 2 occurrences (before that we had ’span’ and ’spanning’) and also how the stemmer reduces other terms: ’elected’ -> ’elect’, ’paths’ -> ’path’, etc.

Finally, to drastically reduce the number of terms, we can remove ’hapaxes’, that is terms with only 1 occurrence.

terms removeHapaxes
--> {(5->'tree'). (2->'root'). (2->'path'). (2->'span')}

The three operations above can be actually performed in one step: terms prune removes stop words, stem tokens, and remove hapaxes.

A MalTerms instance has interesting capabilities, but it becomes much more interesting when you have multiple documents, each with its terms, and you want to compare them.

Building a corpus of documents

A MalCorpus instance just holds all MalTerms instances created from a collection of documents, like a dictionary: each document should have a name as key and a MalTerms instance as value. You can add a new document to your corpus like that:

corpus := MalCorpus new addDocument: #sample with:
(MalTerms fromString:
'A survey of user opinion of computer system response time');
yourself.

corpus atDocument: #sample
--> a MalTerms('user' 'of' 'of' 'survey' 'system' 'a'
'opinion' 'computer' 'response' 'time')

(See the API of MalCorpus and MalTerms for other possibilities, in particular directly importing files and directories).

There is a ready-made sample of documents in the class MalCorpus. Just browse MalCorpus deer90 to continue. A primary ability is building the corpus of terms retrieved for all documents:

corpus := MalCorpus deer90
corpus terms sortedCounts
--> {(7->'of'). (4->'system'). (3->'the'). (3->'trees').
(3->'graph'). (3->'user'). (2->'response'). (2->'eps').
(2->'and'). (2->'minors'). (2->'time'). ...}

Since it’s an instance of MalTerms, we can prune it to get the most relevant terms:

corpus terms prune sortedCounts
--> {(4->'system'). (3->'graph'). (3->'tree'). (3->'user').
(2->'comput'). (2->'human'). (2->'interfac'). (2->'respon').
(2->'time'). (2->'minor'). (2->'ep'). (2->'survei')}

Comparing two corpus of terms

A final capability of the basic Moose-Algos-IR package is its log-likelihood ratio calculator. It offers a simple mean to compare two corpus of terms, identify which terms are the most characteristic in each corpus (compared to the other) and which terms are the most common among both corpus.

You can initialize a log-likelihood instance with

likelihood := MalLogLikelihoodRatio
with: (corpus atDocument: #c4) and: corpus terms

Then you can compute the likelihood ratio for each term of the two corpus with:

likelihood computeAll
--> {('system'->2.6010188219330885).
('engineering'->2.068256902206702).
('testing'->2.068256902206702). ('eps'->1.2191069453459846).
('human'->1.2191069453459846). ('and'->1.2191069453459846).
('of'->0.04065828200909749). ('opinion'->-0.011764206396069454).
... ('the'->-0.03628422592434788)}

The terms are sorted in descending order of likelihood with respect to the first corpus (here, the #c4 document):

  • the higher a positive number is, the more specific the term is to the document with respect to the second corpus (here, with respect to all documents). We can see in a straightforward manner that ’system’ is the most specific, followed by ’engineering’, and ’testing’.
  • Negative numbers indicate terms present in the second corpus but not in the first document. The lower the number, the more the term is present in other documents.
  • Positive numbers nearing zero indicate terms which are common in both corpus, like ’of’.

One can display the likelihood only for terms of the first document using:

likelihood computeAllOfTerms1 
--> {('system'->2.6010188219330885).
('engineering'->2.068256902206702).
('testing'->2.068256902206702). ('eps'->1.2191069453459846).
('human'->1.2191069453459846). ('and'->1.2191069453459846).
('of'->0.04065828200909749)}

The above example performs a normative comparison to extract the most specific terms of a document with respect to the full corpus of documents. However, the two corpus of terms given to MalLogLikelihoodRatio can come from any source and be of any granularity. For example, one can compare two documents with each other, seeing how similar or different they are.

corpus := MalCorpus deer90.
likelihood := MalLogLikelihoodRatio
with: (corpus atDocument: #m2) removeStopwords
and: (corpus atDocument: #m3) removeStopwords.
likelihood computeAll
--> {('paths'->0.8689704674722245).
('intersection'->0.8689704674722245).
('trees'->0.190755274284232).
('graph'->0.190755274284232).
('widths'->-0.19788424721091502).
('minors'->-0.19788424721091502).
('ordering'->-0.19788424721091502).
('iv'->-0.19788424721091502).
('quasi'->-0.19788424721091502)}

Both documents talk about trees and graphs, yet the first deals more with paths and intersection while the second is about widths and ordering.

To get more details about the log-likelihood ratio in Hapax, you can read this brief article from Adrian. He also presents a third interesting application of the log likelihood to compare two subsequent versions of the same document. This offers the possibility to characterize the evolution of the document.

A word of caution

Be careful that using the above tools may actually be the easiest part. Depending on your domain and your problem, just isolating the meaningful parts of your source document can represent a significant problem. Consider for example retrieving terms from a program source: you may have to use language keywords as stopwords, or just extract comments and identifiers to get meaningful terms.

The original Hapax contains many more sophisticated algorithms (including Latent Semantic Indexing). Unfortunately, they are not as well tested and these algorithms are not part of Moose-Algos for now: http://www.squeaksource.com/Hapax.html

Selected publications

    1. Adrian Kuhn. Automatic Labeling of Software Components and their Evolution using Log-Likelihood Ratio of Word Frequencies in Source Code. In MSR '09: Proceedings of the 2009 6th IEEE International Working Conference on Mining Software Repositories, p. 175—178, IEEE, 2009. DOI PDF 
    1. Adrian Kuhn, Stéphane Ducasse, and Tudor Gîrba. Semantic Clustering: Identifying Topics in Source Code. In Information and Software Technology 49(3) p. 230—243, March 2007. DOI PDF