Are Topic Distributions of Documents in LDA Space Probabilistic?

Question

I know that the creation of LDA models is probabilistic, and that two models trained under the same parameters on the same corpus will not necessarily be identical. However, I'm wondering if the topic distribution of a document fed into an LDA model is also probabilistic.

I have an LDA model as presented here:

lda = models.LdaMulticore(corpus=corpus, id2word=dictionary, num_topics=numTopics,passes=10)

as well as two documents, Doc1 and Doc2. I want to find the cosine similarity of the two documents in lda space, so that:

x = cossim(lda[Doc1], lda[Doc2]).

The problem I'm noticing is that when I run this through multiple iterations, the cosine similarity is not always identical. (even when I use the same saved LDA model). The similarity is extremely similar, but it's always a bit off each time. In my actual code I have hundreds of documents, so I'm converting the topic distributions to a dense vector and using numpy to do the calculations in a matrix:

documentsList = np.array(documentsList)
calcMatrix=1-cdist(documentsList, documentsList, metric=self.metric)

Am I running into a rounding error with numpy (or another bug in my code), or is this behavior I should expect when using lda to find the topic distribution of a document?

Edit: I'm going to run a simple cosine similarity on 2 different documents using my lda model, and plot the spread of results. I will report back with what I find.

Ok, here are the results of running cossine similarity against 2 documents, using the same LDA model.

Here is my code:

def testSpacesTwoDocs(doc1, doc2, dictionary):
    simList = []
    lda = gensim.models.ldamodel.LdaModel.load('LDA_Models/lda_bow_behavior_allFields_t385_p10')
    for i in range(50):
        doc1bow = dictionary.doc2bow(doc1)
        doc2bow = dictionary.doc2bow(doc2)

        vec1 = lda[doc1bow]
        vec2 = lda[doc2bow]

        S = matutils.cossim(vec1, vec2)
        simList.append(S)


    for entry in simList:
        print entry

    sns.set_style("darkgrid")
    plt.plot(simList, 'bs--')
    plt.show()


    return

Here are my results: 0.0082616863035, 0.00828413767524, 0.00826550453411, 0.00816756826185, 0.00829832701338, 0.00828970584276, 0.00828578705814, 0.00817109902484, 0.00817138247141, 0.00825297374028, 0.008269435921, 0.00826470121538, 0.00818282042634, 0.00824660449673, 0.00818087532906, 0.0081770261766, 0.00817128310123, 0.00817643202588, 0.00827404791376, 0.00832439428054, 0.00816643128216, 0.00828540881955, 0.00825746652101, 0.00816793513824, 0.00828471827526, 0.00827161219003, 0.00817773114553, 0.00826166001503, 0.00828048713541, 0.00817435544365, 0.0082956702812, 0.00826167470288, 0.00829873425476, 0.00825744872634, 0.00826802120149, 0.00829604894909, 0.0081776752236, 0.00817613482849, 0.00825839326441, 0.00817530362838, 0.0081747561999, 0.0082597447174, 0.00828958180101, 0.00827157760835, 0.00826939127657, 0.00826138381094, 0.00817755590806, 0.00827135780051, 0.00827314260067, 0.00817035250043

Am I correct to assume that the LDA model is infering the topic distribution of both documents at each iteration, and thus that the cosine similarities are stochastic rather than determanistic? Is this much variation a sign that I'm not training my model long enough? Or am I not properly normalizing the vectors? Thanks

Thanks

Can you further detail the extermely similar but not identical part? How much do they vary? — Imanol Luengo
I just ran my tests, and I got average similarity values of .0887, .0901, and .0879 using numpy, and .0894 and .0884 using my original iterative approach. I'm going to explain what I mean by "average" in my original post. — Harry Baker

yanshengjia yanshengjia · Accepted Answer · 2017-12-18T10:06:07

Try setting random_state to the same state when you train a LDA model.

lda = models.LdaMulticore(corpus=corpus, id2word=dictionary, num_topics=numTopics, passes=10, random_state=0)

When LDA initializes, and during inference, it uses randomized matrices that introduces noise to the model. This noise is tiny, and in general will not affect the end result much -- provided enough data.

Are Topic Distributions of Documents in LDA Space Probabilistic?

1 Answers