Moose-Algos-InformationRetrieval makes up the basics of Hapax, a suite of prototypical tools developed by Adrian Kuhn to perform vocabulary analysis on documents.
Moose-Algos-InformationRetrieval provides the following:
A MalTerms instance contains the list of words retrieved from any document with source text and parsed by a scanner. It’s actually a subclass of Bag so you can easily get the number of occurrences for each word.
For example, execute:
MalTerms fromString:
'A survey of user opinion of computer system response time'
--> a MalTerms('user' 'of' 'of' 'survey' 'system' 'a'
'opinion' 'computer' 'response' 'time')
self sortedCounts
--> {(2->'of'). (1->'user'). (1->'survey'). (1->'system').
(1->'a'). (1->'opinion'). (1->'computer'). (1->'response').
(1->'time')}
The MalTerms class is really the workhorse of Moose-Algos-IR: you can perform stemming, pruning, selection operations on a MalTerms instance to further refine the list of words. Let’s use a more complete sample:
terms := MalTerms sample.
terms sortedCounts.
--> {(6->'a'). (5->'tree'). (3->'is'). (2->'i'). (2->'that').
(2->'root'). (2->'the'). (2->'are'). (2->'by'). ...}
It looks interesting but there is also lots of noise due to common words like ’a’ ’is’ ’the’... We will start by removing such tokens which are English stop words (i.e. words too common to be of meaningful value).
terms removeStopwords
--> {(5->'tree'). (2->'root'). (2->'paths'). (1->'elected').
(1->'lan'). (1->'folks'). (1->'cost'). ...}
Now let’s use the Porter stemmer algorithm to group together words with the same root.
terms stemAll
--> {(5->'tree'). (2->'span'). (2->'root'). (2->'path').
(1->'lan'). (1->'folk'). (1->'elect'). ...}
It didn’t do much but notice how we now have the term ’span’ with 2 occurrences (before that we had ’span’ and ’spanning’) and also how the stemmer reduces other terms: ’elected’ -> ’elect’, ’paths’ -> ’path’, etc.
Finally, to drastically reduce the number of terms, we can remove ’hapaxes’, that is terms with only 1 occurrence.
terms removeHapaxes
--> {(5->'tree'). (2->'root'). (2->'path'). (2->'span')}
The three operations above can be actually performed in one step: terms prune removes stop words, stem tokens, and remove hapaxes.
A MalTerms instance has interesting capabilities, but it becomes much more interesting when you have multiple documents, each with its terms, and you want to compare them.
A MalCorpus instance just holds all MalTerms instances created from a collection of documents, like a dictionary: each document should have a name as key and a MalTerms instance as value. You can add a new document to your corpus like that:
corpus := MalCorpus new addDocument: #sample with:
(MalTerms fromString:
'A survey of user opinion of computer system response time');
yourself.
corpus atDocument: #sample
--> a MalTerms('user' 'of' 'of' 'survey' 'system' 'a'
'opinion' 'computer' 'response' 'time')
(See the API of MalCorpus and MalTerms for other possibilities, in particular directly importing files and directories).
There is a ready-made sample of documents in the class MalCorpus. Just browse MalCorpus deer90 to continue. A primary ability is building the corpus of terms retrieved for all documents:
corpus := MalCorpus deer90
corpus terms sortedCounts
--> {(7->'of'). (4->'system'). (3->'the'). (3->'trees').
(3->'graph'). (3->'user'). (2->'response'). (2->'eps').
(2->'and'). (2->'minors'). (2->'time'). ...}
Since it’s an instance of MalTerms, we can prune it to get the most relevant terms:
corpus terms prune sortedCounts
--> {(4->'system'). (3->'graph'). (3->'tree'). (3->'user').
(2->'comput'). (2->'human'). (2->'interfac'). (2->'respon').
(2->'time'). (2->'minor'). (2->'ep'). (2->'survei')}
A final capability of the basic Moose-Algos-IR package is its log-likelihood ratio calculator. It offers a simple mean to compare two corpus of terms, identify which terms are the most characteristic in each corpus (compared to the other) and which terms are the most common among both corpus.
You can initialize a log-likelihood instance with
likelihood := MalLogLikelihoodRatio
with: (corpus atDocument: #c4) and: corpus terms
Then you can compute the likelihood ratio for each term of the two corpus with:
likelihood computeAll
--> {('system'->2.6010188219330885).
('engineering'->2.068256902206702).
('testing'->2.068256902206702). ('eps'->1.2191069453459846).
('human'->1.2191069453459846). ('and'->1.2191069453459846).
('of'->0.04065828200909749). ('opinion'->-0.011764206396069454).
... ('the'->-0.03628422592434788)}
The terms are sorted in descending order of likelihood with respect to the first corpus (here, the #c4 document):
One can display the likelihood only for terms of the first document using:
likelihood computeAllOfTerms1
--> {('system'->2.6010188219330885).
('engineering'->2.068256902206702).
('testing'->2.068256902206702). ('eps'->1.2191069453459846).
('human'->1.2191069453459846). ('and'->1.2191069453459846).
('of'->0.04065828200909749)}
The above example performs a normative comparison to extract the most specific terms of a document with respect to the full corpus of documents. However, the two corpus of terms given to MalLogLikelihoodRatio can come from any source and be of any granularity. For example, one can compare two documents with each other, seeing how similar or different they are.
corpus := MalCorpus deer90.
likelihood := MalLogLikelihoodRatio
with: (corpus atDocument: #m2) removeStopwords
and: (corpus atDocument: #m3) removeStopwords.
likelihood computeAll
--> {('paths'->0.8689704674722245).
('intersection'->0.8689704674722245).
('trees'->0.190755274284232).
('graph'->0.190755274284232).
('widths'->-0.19788424721091502).
('minors'->-0.19788424721091502).
('ordering'->-0.19788424721091502).
('iv'->-0.19788424721091502).
('quasi'->-0.19788424721091502)}
Both documents talk about trees and graphs, yet the first deals more with paths and intersection while the second is about widths and ordering.
To get more details about the log-likelihood ratio in Hapax, you can read this brief article from Adrian. He also presents a third interesting application of the log likelihood to compare two subsequent versions of the same document. This offers the possibility to characterize the evolution of the document.
Be careful that using the above tools may actually be the easiest part. Depending on your domain and your problem, just isolating the meaningful parts of your source document can represent a significant problem. Consider for example retrieving terms from a program source: you may have to use language keywords as stopwords, or just extract comments and identifiers to get meaningful terms.
The original Hapax contains many more sophisticated algorithms (including Latent Semantic Indexing). Unfortunately, they are not as well tested and these algorithms are not part of Moose-Algos for now: http://www.squeaksource.com/Hapax.html