LDA Mallet alternative for get_document_topics - Measuring topics per document

Question

Currently doing an LDA analysis using Python and the Gensim Mallet wrapper. After training the model and getting the topics, I want to see how the topics are distributed over the various document. In the normal Gensim LDA analysis, it is possible to use the get_document_topics function, which I could have used to iterate over every document in my file. However, Mallet wrapper does not have this function. I can retrieve the distribution of topics over one specific document, but can't find a solution to collect and store this over every document (for instance into a list or dataframe).

I can use the following code to acquire the topic distribution over one document:

print (ldamallet[mm[6000]])

which would return the following output:

[(0, 0.3055555555555555), (1, 0.3253968253968254), (2, 0.36904761904761907)]

However, I can't get it to iterate over the more or less 9000 documents in my dataset.

Additional code that could be relevant:

id2word = corpora.Dictionary(wordsFiltered)
id2word.filter_extremes(no_below=167, keep_tokens=None)
mm=[id2word.doc2bow(wordsFilter) for wordsFilter in wordsFiltered]
mallet_path = 'path'
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=mm, num_topics=3, id2word=id2word)

Anyone some suggestions? Thanks in advance!

Principia Principia · Accepted Answer · 2020-02-26T06:37:00

Managed to find a rather simple solution. The following code provided me with a list of lists of all the different percentages per document.

for m in ldamallet[mm]:
    topics_docs.append(m)

If anybody has suggestions to make it more clean or has another approach, feel free to share. Still a beginner so all the advice is welcome.

LDA Mallet alternative for get_document_topics - Measuring topics per document

1 Answers