Extracting Topic distribution from gensim LDA model

Question

I created an LDA model for some text files using gensim package in python. I want to get topic's distributions for the learned model. Is there any method in gensim ldamodel class or a solution to get topic's distributions from the model? For example, I use the coherence model to find a model with the best cohrence value subject to the number of topics in range 1 to 5. After getting the best model I use get_document_topics method (thanks kenhbs) to get topic distribution in the document that used for creating the model.

id2word = corpora.Dictionary(doc_terms)

bow = id2word.doc2bow(doc_terms)

max_coherence = -1
best_lda_model = None

for num_topics in range(1, 6):

    lda_model = gensim.models.ldamodel.LdaModel(corpus=bow, num_topics=num_topics)

    coherence_model = gensim.models.CoherenceModel(model=lda_model, texts=doc_terms,dictionary=id2word)

    coherence_value = coherence_model.get_coherence()

    if coherence_value > max_coherence:
        max_coherence = coherence_value
        best_lda_model = lda_model

The best has 4 topics

print(best_lda_model.num_topics)

4

But when I use get_document_topics, I get less than 4 values for document distribution.

topic_ditrs = best_lda_model.get_document_topics(bow)

print(len(topic_ditrs))

3

My question is: For best lda model with 4 topics (using coherence model) for a document, why get_document_topics returns fewer topics for the same document? why some topics have very small distribution (less than 1-e8)?

KenHBS KenHBS · Accepted Answer · 2018-08-31T18:28:02

From the documentation, you can use two methods for this.

If you are aiming to get the main terms in a specific topic, use get_topic_terms:

from gensim.model.ldamodel import LdaModel

K = 10
lda = LdaModel(some_corpus, num_topics=K)

lda.get_topic_terms(5, topn=10)
# Or for all topics
for i in range(K):
    lda.get_topic_terms(i, topn=10)

You can also print the entire underlying np.ndarray (called either beta or phi in standard LDA papers, dimensions are (K, V) or (V, K)).

phi = lda.get_topics()

edit: From the link i included in the original answer: if you are looking for a document's topic distribution, use

res = lda.get_document_topics(bow)

As can be read from the documentation, the resulting object contains the following three lists:

list of (int, float) – Topic distribution for the whole document. Each element in the list is a pair of a topic’s id, and the probability that was assigned to it.

list of (int, list of (int, float), optional – Most probable topics per word. Each element in the list is a pair of a word’s id, and a list of topics sorted by their relevance to this word. Only returned if per_word_topics was set to True.

list of (int, list of float), optional – Phi relevance values, multipled by the feature length, for each word-topic combination. Each element in the list is a pair of a word’s id and a list of the phi values between this word and each topic. Only returned if per_word_topics was set to True.

Now,

tops, probs = zip(*res[0])

probs will contains K (for you 4) probabilities. Some may be zero, but they should sum up to 1

Extracting Topic distribution from gensim LDA model

3 Answers