Hapax

Hapax is an Information Retrival tool to analyze the vocabulary of software systems, ie how classes and methods are related by topic rather than structure. It was written by Adrian Kuhn in 2005 as validation of his Master’s thesis.

Overview

“Many of the existing approaches in Software Comprehension focus on program structure or external documentation. However, by analyzing formal information only the informal semantics contained in the vocabulary of source code are overlooked. To understand software as a whole, we need to enrich software analysis with the developer knowledge hidden in the code naming.”

Semantic Clustering identifies topics in source code. Based on Latent Semantic Indexing and clustering, it source artifacts that use similar vocabulary. We call these groups semantic clusters and interpret them as linguistic topics that reveal the intention of the code. We compare the concepts to each other, identify links between them, provide automatically retrieved labels, and use a visualization to illustrate how they are distributed over the system’s structure. Our approach is language independent as it works at the level of identifier names and comments.

Hapax is a software analysis tool, build on top of Moose. Adrian Kuhn developed Hapax as part of his Master’s thesis, and it implements Semantic Clustering. The name of the tool is derived from the term hapax legomenon, that refers to a word occuring only once a given body of text.

Frequently, we are asked about the difference between topics and concepts, you may find the the following excerpt from the journal paper useful to understand the difference.

On the congruence between topics and concepts

When starting this work, one of our hypotheses was that Semantic Clustering will reveal a systems domain semantics. But our experiments disproved this hypothesis: most linguistic topics are applications concepts or architectural components, such as layers. In most case studies, our approach partitioned the system into one (or sometimes two) large domain-specific part and up to a dozen domain-independent parts, such as for example input/output or data storage facilities. Consider for example the application below, Outsight, a webbased job portal application. It is divided into nine parts as follows:

OutsightDistributionMap
Only one topic out of nine concepts is about the system’s domain: job exchange. Topic Red includes the complete domain of the system: that is users, companies and CVs. Whereas all the other topics are application specific components: topic Blue is a search engine, topic DarkGreen implements PDF generation, topic Green is text and file handling, topic Cyan and Magenta provide access to the database, and topic DarkCyan is a testing and debugging facility. Additionally the cross-cutting topic Yellow bundles high-level clones related to time and timestamps.

Publications

Download

The current research prototype of Hapax is available at the following Store coordinates

  Bundle: HapaxDevelopment   
  interface: PostgresSQLEXDIConnection
  environment: db.iam.unibe.ch_scgStore
  user name: storeguest
  password: storeguest
  table owner: BERN

Please drop a mail to for questions and feedback.

License: BSD