LDA Topic Model Performance - Topic Coherence Implementation for scikit-learn

Question

I have a question around measuring/calculating topic coherence for LDA models built in scikit-learn.

Topic Coherence is a useful metric for measuring the human interpretability of a given LDA topic model. Gensim's CoherenceModel allows Topic Coherence to be calculated for a given LDA model (several variants are included).

I am interested in leveraging scikit-learn's LDA rather than gensim's LDA for ease of use and documentation (note: I would like to avoid using the gensim to scikit-learn wrapper i.e. actually leverage sklearn’s LDA). From my research, there is seemingly no scikit-learn equivalent to Gensim’s CoherenceModel.

Is there a way to either:

1 - Feed scikit-learn’s LDA model into gensim’s CoherenceModel pipeline, either through manually converting the scikit-learn model into gensim format or through a scikit-learn to gensim wrapper (I have seen the wrapper the other way around) to generate Topic Coherence?

Or

2 - Manually calculate topic coherence from scikit-learn’s LDA model and CountVectorizer/Tfidf matrices?

I have done quite a bit of research on implementations for this use case online but haven’t seen any solutions. The only leads I have are the documented equations from scientific literature.

If anyone has any knowledge on any similar implementations, or if you could point me in the right direction for creating a manual method for this, that would be great. Thank you!

*Side note: I understand that perplexity and log-likelihood are available in scikit-learn for performance measurements, but these are not as predictive from what I have read.

I don't have any direct answer to your question. However, why don't you just use gensim to fit a new LDA model? I have no experience with LDA in scikit-learn, but I do know that gensim is blazing fast and nice to use — KenHBS
Honestly Gensim's own LDA is not so fast when corpus gets really large and you ask for 50+ topics. It also tends to have poorer results than those of i.e MALLET. I have the same question with the author. — Tolga

jhl jhl · Accepted Answer · 2019-04-14T15:45:50

Feed scikit-learn’s LDA model into gensim’s CoherenceModel pipeline

As far as I know, there is no "easy way" to do this. You would have to manually reformat the sklearn data structures to be compatible with gensim. I haven't attempted this myself, but this strikes me as an unnecessary step that might take a long time. There is an old Python 2.7 attempt at a gensim-sklearn-wrapper which you might want to look at, but it seems deprecated - maybe you can get some information/inspiration from that.

Manually calculate topic coherence from scikit-learn’s LDA model and CountVectorizer/Tfidf matrices?

The summing-up of vectors you need can be easily achieved with a loop. You can find code samples for a "manual" coherence calculation for NMF. Calculation depends on the specific measure, of course, but sklearn should return you the data you need for the analysis pretty easily.

Resources

It is unclear to me why you would categorically exclude gensim - the topic coherence pipeline is pretty extensive, and documentation exists.

See, for example, these three tutorials (in Jupyter notebooks).

The formulas for several coherence measures can be found in this paper here.

LDA Topic Model Performance - Topic Coherence Implementation for scikit-learn

1 Answers