3
votes

I have a set of documents and I want to know the topic distribution for each document (for different values of number of topics). I have taken a toy program from this question. I have first used LDA provided by gensim and then I am again giving test data as my training data itself to get the topic distribution of each doc in training data . But I am getting uniform topic distribution always.

Here is the toy code I used

import gensim
import logging
logging.basicConfig(filename="logfile",format='%(message)s', level=logging.INFO)


def get_doc_topics(lda, bow):
    gamma, _ = lda.inference([bow])
    topic_dist = gamma[0] / sum(gamma[0])  # normalize distribution

documents = ['Human machine interface for lab abc computer applications',
             'A survey of user opinion of computer system response time',
             'The EPS user interface management system',
             'System and human system engineering testing of EPS',
             'Relation of user perceived response time to error measurement',
             'The generation of random binary unordered trees',
             'The intersection graph of paths in trees',
             'Graph minors IV Widths of trees and well quasi ordering',
             'Graph minors A survey']

texts = [[word for word in document.lower().split()] for document in documents]
dictionary = gensim.corpora.Dictionary(texts)
id2word = {}
for word in dictionary.token2id:    
    id2word[dictionary.token2id[word]] = word
mm = [dictionary.doc2bow(text) for text in texts]
lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=2, update_every=1, chunksize=10000, passes=1,minimum_probability=0.0)

newdocs=["human system"]
print lda[dictionary.doc2bow(newdocs)]

newdocs=["Human machine interface for lab abc computer applications"] #same as 1st doc in training
print lda[dictionary.doc2bow(newdocs)]

Here is the output:

[(0, 0.5), (1, 0.5)]
[(0, 0.5), (1, 0.5)]

I have checked with some more examples but all ended up giving the same equiprobable result.

Here is the logfile generated(i.e output of logger)

adding document #0 to Dictionary(0 unique tokens: [])
built Dictionary(42 unique tokens: [u'and', u'minors', u'generation', u'testing', u'iv']...) from 9 documents (total 69 corpus positions)
using symmetric alpha at 0.5
using symmetric eta at 0.5
using serial LDA version on this node
running online LDA training, 2 topics, 1 passes over the supplied corpus of 9 documents, updating model once every 9 documents, evaluating perplexity every 9 documents, iterating 50x with a convergence threshold of 0.001000
too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
-5.796 per-word bound, 55.6 perplexity estimate based on a held-out corpus of 9 documents with 69 words
PROGRESS: pass 0, at document #9/9
topic #0 (0.500): 0.057*"of" + 0.043*"user" + 0.041*"the" + 0.040*"trees" + 0.039*"interface" + 0.036*"graph" + 0.030*"system" + 0.027*"time" + 0.027*"response" + 0.026*"eps"
topic #1 (0.500): 0.088*"of" + 0.061*"system" + 0.043*"survey" + 0.040*"a" + 0.036*"graph" + 0.032*"trees" + 0.032*"and" + 0.032*"minors" + 0.031*"the" + 0.029*"computer"
topic diff=0.539396, rho=1.000000

It says ' too few updates, training might not converge' so I have tried increasing no of passes to 1000 but the output is still same. (though it is not related to convergence , I have also tried increasing no of topics)

1

1 Answers

2
votes

The problem is in transforming the variable newdocs into a gensim document. dictionary.doc2bow() does indeed expect a list but a list of words. You provide a list of documents so it interprets "human system" as a word but there is no such word in the training set so it ignores it. To make my point clearer see the output of the following code

import gensim
documents = ['Human machine interface for lab abc computer applications',
             'A survey of user opinion of computer system response time',
             'The EPS user interface management system',
             'System and human system engineering testing of EPS',
             'Relation of user perceived response time to error measurement',
             'The generation of random binary unordered trees',
             'The intersection graph of paths in trees',
             'Graph minors IV Widths of trees and well quasi ordering',
             'Graph minors A survey']

texts = [[word for word in document.lower().split()] for document in documents]
dictionary = gensim.corpora.Dictionary(texts)

print dictionary.doc2bow("human system".split())
print dictionary.doc2bow(["human system"])
print dictionary.doc2bow(["human"])
print dictionary.doc2bow(["foo"])

So to correct the above code all you have to do is change newdocs according to the following

newdocs = "human system".lower().split()
newdocs = "Human machine interface for lab abc computer applications".lower().split()

Oh, by the way the behaviour you observe, getting the same probabilities, is simply the topic distribution of the empty document, a uniform distribution that is.