The popular topic model, Latent Dirichlet Allocation (LDA), which when used to extract topics from a corpus, returns different topics with different probability distributions over the dictionary words.
Whereas Latent Semantic Indexing (LSI) gives the same topics and same distributions after every iteration.
In reality LDA is widely used to extract topics. How does LDA maintain consistency if it returns different topic distribution every time a classification is made?
Consider this simple example. A sample of documents are taken where D represents a document:
D1: Linear Algebra techniques for dimensionality reduction
D2: dimensionality reduction of a sample database
D3: An introduction to linear algebra
D4: Measure of similarity and dissimilarity of different web documents
D5: Classification of data using database sample
D6: overfitting due lack of representative samples
D7: handling overfitting in descision tree
D8: proximity measure for web documents
D9: introduction to web query classification
D10: classification using LSI
Each line represents a document. On the above corpus the LDA model is used to generate the topics from the document. Gensim is used for LDA, batch LDA is performed where number of topics chosen are 4 and number of passes are 20.
Now on the original corpus the batch LDA is performed and the topics generated after 20 passes are:
topic #0: 0.045*query + 0.043*introduction + 0.042*similarity + 0.042*different + 0.041*reduction + 0.040*handling + 0.039*techniques + 0.039*dimensionality + 0.039*web + 0.039*using
topic #1: 0.043*tree + 0.042*lack + 0.041*reduction + 0.040*measure + 0.040*descision + 0.039*documents + 0.039*overfitting + 0.038*algebra + 0.038*proximity + 0.038*query
topic #2: 0.043*reduction + 0.043*data + 0.042*proximity + 0.041*linear + 0.040*database + 0.040*samples + 0.040*overfitting + 0.039*lsi + 0.039*introduction + 0.039*using
topic #3: 0.046*lsi + 0.045*query + 0.043*samples + 0.040*linear + 0.040*similarity + 0.039*classification + 0.039*algebra + 0.039*documents + 0.038*handling + 0.037*sample
Now batch LDA is performed on the same original corpus again and the topics generated in that case are:
topic #0: 0.041*data + 0.041*descision + 0.041*linear + 0.041*techniques + 0.040*dimensionality + 0.040*dissimilarity + 0.040*database + 0.040*reduction + 0.039*documents + 0.038*proximity
topic #1: 0.042*dissimilarity + 0.041*documents + 0.041*dimensionality + 0.040*tree + 0.040*proximity + 0.040*different + 0.038*descision + 0.038*algebra + 0.038*similarity + 0.038*techniques
topic #2: 0.043*proximity + 0.042*data + 0.041*database + 0.041*different + 0.041*tree + 0.040*techniques + 0.040*linear + 0.039*classification + 0.038*measure + 0.038*representative
topic #3: 0.043*similarity + 0.042*documents + 0.041*algebra + 0.041*web + 0.040*proximity + 0.040*handling + 0.039*dissimilarity + 0.038*representative + 0.038*tree + 0.038*measure
The word distribution in each topic is not same in both the cases. In fact, the word distribution is never the same.
So how does LDA work effectively if it doesn't have the same word distribution in its topics like LSI?