2
votes

I am running LDA on a small corpus of 2 docs (sentences) for testing purposes. Following code returns topic-term and document-topic distributions that are not reasonable at all given the input documents. Running exactly the same returns in Python reasonable results. Who knows what is wrong here?

library(topicmodels)
library(tm)

d1 <- "bank bank bank"
d2 <- "stock stock stock"

corpus <- Corpus(VectorSource(c(d1,d2)))

##fit lda to data
dtm <- DocumentTermMatrix(corpus)
ldafit <- LDA(dtm, k=2, method="Gibbs") 

##get posteriors
topicTerm <- t(posterior(ldafit)$terms)
docTopic <- posterior(ldafit)$topics
topicTerm
docTopic

> topicTerm
              1         2
bank  0.3114525 0.6885475
stock 0.6885475 0.3114525
> docTopic
          1         2
1 0.4963245 0.5036755
2 0.5036755 0.4963245

The results from Python are as follows:

>>> docTopic
array([[ 0.87100799,  0.12899201],
       [ 0.12916713,  0.87083287]])
>>> fit.print_topic(1)
u'0.821*"bank" + 0.179*"stock"'
>>> fit.print_topic(0)
u'0.824*"stock" + 0.176*"bank"'
2
Interesting question. I have used this package in a real setting and it produced good results. I'm pretty sure these stinky results have something to do with the tiny corpus you're using. Could you post the results from Python for comparison? - KenHBS
Sure, please see the updated post. - schimo
gensim and topicmodels use different methods. gensim uses variational inference, whereas topicmodels uses collapsed gibbs sampling here. topicmodels also has a variational inference option, but it gives the same crappy results.. I'm at a dead end too, sorry - KenHBS
But the results in R with topicmodels nearly do not change when using 'VEM' as estimation method... - schimo

2 Answers

2
votes

The author of the R package topicmodels, Bettina GrĂ¼n, pointed out that this is due to the selection of the hyperparameter 'alpha'.

LDA in R selects alpha = 50/k= 25 while LDA in gensim Python selects alpha = 1/k = 0.5. A smaller alpha value favors sparse solutions of document-topic distributions, i.e. documents contain mixture of just a few topics. Hence, decreasing alpha in LDA in R yields very reasonable results:

ldafit <- LDA(dtm, k=2, method="Gibbs", control=list(alpha=0.5)) 

posterior(ldafit)$topics
#    1     2
# 1  0.125 0.875
# 2  0.875 0.125

posterior(ldafit)$terms
#   bank    stock
# 1 0.03125 0.96875
# 2 0.96875 0.03125
0
votes

Try to plot the perplexity over iterations and make sure they converge. Initial status also matters. (The document size and sample size both seem to be small, though.)