9
votes

I am now going through LDA(Latent Dirichlet Allocation) Topic modelling method to help in extraction of topics from a set of documents. As from what I have understood from the link below, this is an unsupervised learning approach to categorize / label each of the documents with the extracted topics.

Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation

In the sample code given in that link, there is a function defined to get the top words associated with each of the topic identified.

sklearn.__version__

Out[41]: '0.17'

from sklearn.decomposition import LatentDirichletAllocation 


def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

My Question is this. Is there any component or matrix of the built model LDA, from where we can get the document-topic association ?

For example, I need to find top 2 topics associated with each doc as the document label / Category for that Doc. Is there any component to find distribution of topics in a document, similar to the model.components_ for finding words distribution within a topic.

1

1 Answers

9
votes

You can compute the document-topic association using the transform(X) function of the LDA class.

On the example code, this would be:

doc_topic_distrib = lda.transform(tf)

with lda the fitted lda, and tf the input data you want to transform