6
votes

I created a Gensim LDA Model as shown in this tutorial: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

lda_model = gensim.models.LdaMulticore(data_df['bow_corpus'], num_topics=10, id2word=dictionary, random_state=100, chunksize=100, passes=10, per_word_topics=True)

And it generates 10 topics with a log_perplexity of:

lda_model.log_perplexity(data_df['bow_corpus']) = -5.325966117835991

But when I run the coherence model on it to calculate coherence score, like so:

coherence_model_lda = CoherenceModel(model=lda_model, texts=data_df['bow_corpus'].tolist(), dictionary=dictionary, coherence='c_v')
with np.errstate(invalid='ignore'):
    lda_score = coherence_model_lda.get_coherence()

My LDA-Score is nan. What am I doing wrong here?

2

2 Answers

8
votes

Solved! Coherence Model requires the original text, instead of the training corpus fed to LDA_Model - so when i ran this:

coherence_model_lda = CoherenceModel(model=lda_model, texts=data_df['corpus'].tolist(), dictionary=dictionary, coherence='c_v')
with np.errstate(invalid='ignore'):
    lda_score = coherence_model_lda.get_coherence()

I got a coherence score of: 0.462

Hope this helps someone else making the same mistake. Thanks!

0
votes

The documentation (https://radimrehurek.com/gensim/models/coherencemodel.html) says to provide "Tokenized texts" (list of list of str) - these should be your texts split into individual words that are in the dictionary you provide to CoherenceModel. If you provide the full texts that are not tokenized, there are no entries in the lookup dictionary for the words.