0
votes

The LDA code generates topics say from 0 to 5 . Is there a standard way (a norm) used to link the generated topics and the documents themselves. Eg: doc1 is of Topic0 , doc5 is of topic Topic1 etc. One way i can think of is to string search each of geenrated key words in each topic on the docs , is there a generic way or practice followed for this?

Ex LDA code - https://github.com/manhcompany/lda/blob/master/lda.py

1

1 Answers

0
votes

I "collected some code", and this worked for me. Assuming you have a term frequency

tf_vectorizer = CountVectorizer("parameters of your choice")
tf = tf_vectorizer.fit_transform("your data)`
lda_model = LatentDirichletAllocation("other parameters of your choice")
lda_model.fit(tf)

create the topic-document matrix (the crucial step), and select the num_topic most important topics

doc_topic = lda_model.transform(tf)
num_most_important_topic = 2

dominant_topic = []
for ind_doc in range(doc_topic.shape[0]):
    dominant_topic.append(sorted(range(len(doc_topic[ind_doc])),
                          key=lambda ind_top: doc_topic[ind_doc][ind_top],
                          reverse=True)[:num_most_important_topic])

This should give you an array of the num_most_important_topic topics. Good luck!