0
votes

I am using gensim LDA for topic modelling. I need to get the topic distribution of a corpus, not the individual documents. Let say I have 1000 documents, which belongs to 10 different categories (let say 100 docs for each category). After training the LDA model overall 1000 documents, then I want to see what are the dominant topics of each category. The following image illustrates my dataset and aim.

enter image description here

So far I can think of two approaches, but I am not sure either is sane, I will be happy to know if there is a better way of doing it.

In the first approach, I can concatenate the documents of each category into one large document. So there will be only 10 large documents, hence for each document, I will be able to retrieve its topic distribution.

Another approach might be getting the topic distribution of all document, without concatenating documents. Hence for each category, we will have 100 documents topic distributions. To get the dominant topics for each category, I may sum the probability of each topic, and get only a few highest scored topics. I am not sure any of this approaches are right, what would you suggest?

1

1 Answers

1
votes

In approach 1), you are concatenating documents (of possibly different lengths), and getting topics of one big document. So importance of smaller documents is likely to get diminished.

In approach 2), documents of all lengths get almost equal importance (depending on how you combine the topic distributions)

Approach you need to go with will depend on your usecase.