Topic Categorisation

Published: Posted on

As part of a joint research venture between Birmingham and DigitalEd, the question of how to automatically tag content within Mobius was investigated.

Tagging content within Mobius is currently under-utilised. Content is inconsistently or poorly tagged when it is, or even not at all. A larger research question across Europe was created to try to create a consistent set of tags. Tagging provides searchability for other users, which is probably the biggest barrier to re-using content either institutionally or ideally across institutions.

The goal of this research was to use the well tagged questions as training data, and then investigate whether we could automatically tag new content based upon the content of the question, essentially using the machine to read the question and interpret it. Text analysis was investigated (probabilistic latent semantic analysis), which provides a way of generating a “model” between the content of questions and the tags that have been employed. Upon seeing a new question, the most likely tags can then be generated by this model to the user.

The research showed, however, that standard text mining algorithms are fairly effective at finding texts that are similar (unsupervised learning), but are deficient at predicting topics. They suffer from rather subtle yet elementary issues which means that prediction is more ad hoc than one might expect. Nonetheless, heuristic methods have been created which tag content relatively well, and surprisingly we found that a  well developed model could compete with some very advanced off-the-shelf AI solutions.

This research is leading to a paper to be published, which clarifies some confusion in the literature regarding supervised probabilistic latent semantic analysis.


Leave a Reply

Your email address will not be published.