2
votes

I created an LDA model for some text files using gensim package in python. I want to get topic's distributions for the learned model. Is there any method in gensim ldamodel class or a solution to get topic's distributions from the model? For example, I use the coherence model to find a model with the best cohrence value subject to the number of topics in range 1 to 5. After getting the best model I use get_document_topics method (thanks kenhbs) to get topic distribution in the document that used for creating the model.

id2word = corpora.Dictionary(doc_terms)

bow = id2word.doc2bow(doc_terms)

max_coherence = -1
best_lda_model = None

for num_topics in range(1, 6):

    lda_model = gensim.models.ldamodel.LdaModel(corpus=bow, num_topics=num_topics)

    coherence_model = gensim.models.CoherenceModel(model=lda_model, texts=doc_terms,dictionary=id2word)

    coherence_value = coherence_model.get_coherence()

    if coherence_value > max_coherence:
        max_coherence = coherence_value
        best_lda_model = lda_model

The best has 4 topics

print(best_lda_model.num_topics)

4

But when I use get_document_topics, I get less than 4 values for document distribution.

topic_ditrs = best_lda_model.get_document_topics(bow)

print(len(topic_ditrs))

3

My question is: For best lda model with 4 topics (using coherence model) for a document, why get_document_topics returns fewer topics for the same document? why some topics have very small distribution (less than 1-e8)?

3

3 Answers

2
votes

From the documentation, you can use two methods for this.

If you are aiming to get the main terms in a specific topic, use get_topic_terms:

from gensim.model.ldamodel import LdaModel

K = 10
lda = LdaModel(some_corpus, num_topics=K)

lda.get_topic_terms(5, topn=10)
# Or for all topics
for i in range(K):
    lda.get_topic_terms(i, topn=10)

You can also print the entire underlying np.ndarray (called either beta or phi in standard LDA papers, dimensions are (K, V) or (V, K)).

phi = lda.get_topics()

edit: From the link i included in the original answer: if you are looking for a document's topic distribution, use

res = lda.get_document_topics(bow)

As can be read from the documentation, the resulting object contains the following three lists:

  • list of (int, float) – Topic distribution for the whole document. Each element in the list is a pair of a topic’s id, and the probability that was assigned to it.

  • list of (int, list of (int, float), optional – Most probable topics per word. Each element in the list is a pair of a word’s id, and a list of topics sorted by their relevance to this word. Only returned if per_word_topics was set to True.

  • list of (int, list of float), optional – Phi relevance values, multipled by the feature length, for each word-topic combination. Each element in the list is a pair of a word’s id and a list of the phi values between this word and each topic. Only returned if per_word_topics was set to True.

Now,

tops, probs = zip(*res[0])

probs will contains K (for you 4) probabilities. Some may be zero, but they should sum up to 1

1
votes

You can play with the minimum_probability parameter and set it to a very small value like 0.000001.

topic_vector = [ x[1] for x in ldamodel.get_document_topics(new_doc_bow , minimum_probability= 0.0, per_word_topics=False)]
0
votes

Just type,

pd.DataFrame(lda_model.get_document_topics(doc_term_matrix))