I would like to see how to access dictionary from gensim lda topic model. This is particularly important when you train lda model, save and load it later on. In the other words, suppose lda_model is the model trained on a collection of documents. To get document-topic matrix one can do something like below or something like the one explained in https://www.kdnuggets.com/2019/09/overview-topics-extraction-python-latent-dirichlet-allocation.html:
def regTokenize(text):
# tokenize the text into words
import re
WORD = re.compile(r'\w+')
words = WORD.findall(text)
return words
from gensim.corpora.dictionary import Dictionary
ttext = [regTokenize(d) for d in text]
dic = Dictionary(ttext)
ttext = [dic.doc2bow(text) for text in ttext]
ttext = lda_model.get_document_topics(ttext)
However, dictionary in trained lda_model
might be different from new data and gives error for the last line, like:
"IndexError: index 41021 is out of bounds for axis 1 with size 41021"
Is there any way (or parameter) to obtain dictionary from trained lda_model
, to use it instead of dic = Dictionary(ttext)
? Your help and answer much appreciated!