I have a set of documents and I want to know the topic distribution for each document (for different values of number of topics). I have taken a toy program from this question. I have first used LDA provided by gensim and then I am again giving test data as my training data itself to get the topic distribution of each doc in training data . But I am getting uniform topic distribution always.
Here is the toy code I used
import gensim
import logging
logging.basicConfig(filename="logfile",format='%(message)s', level=logging.INFO)
def get_doc_topics(lda, bow):
gamma, _ = lda.inference([bow])
topic_dist = gamma[0] / sum(gamma[0]) # normalize distribution
documents = ['Human machine interface for lab abc computer applications',
'A survey of user opinion of computer system response time',
'The EPS user interface management system',
'System and human system engineering testing of EPS',
'Relation of user perceived response time to error measurement',
'The generation of random binary unordered trees',
'The intersection graph of paths in trees',
'Graph minors IV Widths of trees and well quasi ordering',
'Graph minors A survey']
texts = [[word for word in document.lower().split()] for document in documents]
dictionary = gensim.corpora.Dictionary(texts)
id2word = {}
for word in dictionary.token2id:
id2word[dictionary.token2id[word]] = word
mm = [dictionary.doc2bow(text) for text in texts]
lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=2, update_every=1, chunksize=10000, passes=1,minimum_probability=0.0)
newdocs=["human system"]
print lda[dictionary.doc2bow(newdocs)]
newdocs=["Human machine interface for lab abc computer applications"] #same as 1st doc in training
print lda[dictionary.doc2bow(newdocs)]
Here is the output:
[(0, 0.5), (1, 0.5)]
[(0, 0.5), (1, 0.5)]
I have checked with some more examples but all ended up giving the same equiprobable result.
Here is the logfile generated(i.e output of logger)
adding document #0 to Dictionary(0 unique tokens: [])
built Dictionary(42 unique tokens: [u'and', u'minors', u'generation', u'testing', u'iv']...) from 9 documents (total 69 corpus positions)
using symmetric alpha at 0.5
using symmetric eta at 0.5
using serial LDA version on this node
running online LDA training, 2 topics, 1 passes over the supplied corpus of 9 documents, updating model once every 9 documents, evaluating perplexity every 9 documents, iterating 50x with a convergence threshold of 0.001000
too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
-5.796 per-word bound, 55.6 perplexity estimate based on a held-out corpus of 9 documents with 69 words
PROGRESS: pass 0, at document #9/9
topic #0 (0.500): 0.057*"of" + 0.043*"user" + 0.041*"the" + 0.040*"trees" + 0.039*"interface" + 0.036*"graph" + 0.030*"system" + 0.027*"time" + 0.027*"response" + 0.026*"eps"
topic #1 (0.500): 0.088*"of" + 0.061*"system" + 0.043*"survey" + 0.040*"a" + 0.036*"graph" + 0.032*"trees" + 0.032*"and" + 0.032*"minors" + 0.031*"the" + 0.029*"computer"
topic diff=0.539396, rho=1.000000
It says ' too few updates, training might not converge' so I have tried increasing no of passes to 1000 but the output is still same. (though it is not related to convergence , I have also tried increasing no of topics)