7
votes

Latent Dirichlet Allocation(LDA) is a topic model to find latent variable (topics) underlying a bunch of documents. I'm using python gensim package and having two problems:

  1. I printed out the most frequent words for each topic (I tried 10,20,50 topics), and found out that the distribution over words is very "flat": meaning even the most frequent word has only 1% probability...

  2. Most of the topics are similar: meaning the most frequent words for each of the topics overlap a lot and the topics share almost the same set of words for their high frequency words...

I guess the problem is probably due to my documents: my documents actually belong to a specific category, for example, they are all documents introducing different online games. For my case, will LDA still work, since the documents themselves are quite similar, so a model based on "bag of words" may not be a good way to try?

Could anyone give me some suggestions? Thank you!

1

1 Answers

2
votes

I've found NMF to perform better when a corpus is smaller and more focused around a particular topic. In a corpus of ~250 documents all discussing the same issue NMF was able to pull 7 distinct coherent topics. This has also been reported by other researchers...

"Another advantage that is particularly useful for the appli- cation presented in this paper is that NMF is capable of identifying niche topics that tend to be under-reported in traditional LDA approaches" (p.6)

Greene & Cross, Exploring the Political Agenda of the European Parliament Using a Dynamic Topic Modeling Approach, PDF

Unfortunately Gensim doesn't have an implementation of NMF but it is in Scikit-Learn. To work effectively you need to feed NMF some TFIDF weighted word vectors rather than frequency counts like you do with LDA.

If you're used to Gensim and have preprocessed everything that way genesis has some utilities to convert a corpus top Scikit compatible structures. However I think it would actually be simpler to just use all Scikit. There is a good example of using NMF here.