1
votes

The popular topic model, Latent Dirichlet Allocation (LDA), which when used to extract topics from a corpus, returns different topics with different probability distributions over the dictionary words.

Whereas Latent Semantic Indexing (LSI) gives the same topics and same distributions after every iteration.

In reality LDA is widely used to extract topics. How does LDA maintain consistency if it returns different topic distribution every time a classification is made?

Consider this simple example. A sample of documents are taken where D represents a document:

D1: Linear Algebra techniques for dimensionality reduction
D2: dimensionality reduction of a sample database
D3: An introduction to linear algebra
D4: Measure of similarity and dissimilarity of different web documents
D5: Classification of data using database sample
D6: overfitting due lack of representative samples
D7: handling overfitting in descision tree
D8: proximity measure for web documents
D9: introduction to web query classification
D10: classification using LSI 

Each line represents a document. On the above corpus the LDA model is used to generate the topics from the document. Gensim is used for LDA, batch LDA is performed where number of topics chosen are 4 and number of passes are 20.

Now on the original corpus the batch LDA is performed and the topics generated after 20 passes are:

topic #0: 0.045*query + 0.043*introduction + 0.042*similarity + 0.042*different + 0.041*reduction + 0.040*handling + 0.039*techniques + 0.039*dimensionality + 0.039*web + 0.039*using

topic #1: 0.043*tree + 0.042*lack + 0.041*reduction + 0.040*measure + 0.040*descision + 0.039*documents + 0.039*overfitting + 0.038*algebra + 0.038*proximity + 0.038*query

topic #2: 0.043*reduction + 0.043*data + 0.042*proximity + 0.041*linear + 0.040*database + 0.040*samples + 0.040*overfitting + 0.039*lsi + 0.039*introduction + 0.039*using

topic #3: 0.046*lsi + 0.045*query + 0.043*samples + 0.040*linear + 0.040*similarity + 0.039*classification + 0.039*algebra + 0.039*documents + 0.038*handling + 0.037*sample

Now batch LDA is performed on the same original corpus again and the topics generated in that case are:

topic #0: 0.041*data + 0.041*descision + 0.041*linear + 0.041*techniques + 0.040*dimensionality + 0.040*dissimilarity + 0.040*database + 0.040*reduction + 0.039*documents + 0.038*proximity

topic #1: 0.042*dissimilarity + 0.041*documents + 0.041*dimensionality + 0.040*tree + 0.040*proximity + 0.040*different + 0.038*descision + 0.038*algebra + 0.038*similarity + 0.038*techniques

topic #2: 0.043*proximity + 0.042*data + 0.041*database + 0.041*different + 0.041*tree + 0.040*techniques + 0.040*linear + 0.039*classification + 0.038*measure + 0.038*representative

topic #3: 0.043*similarity + 0.042*documents + 0.041*algebra + 0.041*web + 0.040*proximity + 0.040*handling + 0.039*dissimilarity + 0.038*representative + 0.038*tree + 0.038*measure

The word distribution in each topic is not same in both the cases. In fact, the word distribution is never the same.

So how does LDA work effectively if it doesn't have the same word distribution in its topics like LSI?

4
I'm not sure I understand the problem. Are you worried that two runs of an LDA training algorithm might return different models?Fred Foo
@larsmans added some more information to make my point clear. Hope it is clearKai

4 Answers

4
votes

I think there's two issues here. Firstly, LDA training is not deterministic like LSI is; the common training algorithms for LDA are sampling methods. If results over multiple training runs are wildly different, that's either a bug, or you've used the wrong settings, or plain bad luck. You can try multiple runs of LDA training if you're trying to optimize some function.

Then as for clustering, querying and classification: once you have a trained LDA model, you can apply that model to other documents in a deterministic way. Different LDA models will give you different results, but from one LDA model that you've labeled as the final model, you'll always get the same result.

0
votes

If LDA uses randomness in both training and inference steps, it will generate different topics everytime. See this link: LDA model generates different topics everytime i train on the same corpus

0
votes

There are three solutions to this problem:

  1. set a random_seed = 123
  2. pickle - you can save your trained model as a file and reaccess as you'd like without changing the topics. You can even transfer this file to another machine and implement it by calling. We create a file name for the pre-trained model open the file to save as a pickle. Close the pickle Instance. Loading the saved LDA Mallet wrapped pickle:

    LDAMallet_file = 'Your Model'
    
    LDAMallet_pkl = open(LDAMallet_file, 'wb')
    pickle.dump(ldamallet, LDAMallet_pkl)
    
    LDAMallet_pkl_15.close()
    
    LDAMallet_file = 'Your Model'
    LDAMallet_pkl = open(LDAMallet_file, 'rb')
    ldamallet = pickle.load(LDAMallet_pkl)
    
    print("Loaded LDA Mallet wrap --", ldamallet)
    

    Check out the documentation: https://docs.python.org/3/library/pickle.html

    Get it? pickle because it preserves ;)

  3. joblib - same as pickle better with arrays

I hope this helps :)

0
votes

I am not entirely sure if I understand the problem, but to make it precise you are saying that LDA produces different topic distribution on different run for the same set of data.

First LDA uses the randomness to get those probability distribution so for each run you will get different topic weights and words, but you can control this randomness.

gensim.models.ldamodel.LdaModel(
    corpus, num_topics=number_of_topics, id2word=dictionary, passes=15, random_state=1)

You see the use of random_state if you fix this number you can easily reproduce the output.