2
votes

As I understand, if i'm training a LDA model over a corpus where the size of the dictionary is say 1000 and no of topics (K) = 10, for each word in the dictionary I should have a vector of size 10 where each position in the vector is the probability of that word belongs to that particular topic, right?

So my question is given a word, what is the probability of that word belongs to to topic k where k could be from 1 to 10, how do I get this value in the gensim lda model?

I was using get_term_topics method but it doesn't output all the probabilities for all the topics. For eg.,

lda_model1.get_term_topics("fun")
[(12, 0.047421702085626238)],

but I want to see what is the prob that "fun" could be in all the other topics as well?

1

1 Answers

3
votes

For someone who is looking for the ans, i found it.

These prob values are in the xx.expElogbeta numpy array. Number of rows in this matrix is equivalent to the number of topics and the no of columns is the size of your dictionary (words). So if you get the values for a particular column, you get the prob of that word belonging to all the topics.

e.g.,

>>> data = np.load("model.expElogbeta.npy")
>>> data.shape
(20, 6481) # i have trained with 20 topics == no of rows
>>> dict = corpora.Dictionary.load(dictf)
>>> len(dict.keys())
6481 #columns of the npy array is the words in my dict

src = https://groups.google.com/forum/?fromgroups=#!searchin/gensim/lda$20topic-word$20matrix/gensim/Qoj7Agkx3qE/r9lyfihC4b4J