I know that the creation of LDA models is probabilistic, and that two models trained under the same parameters on the same corpus will not necessarily be identical. However, I'm wondering if the topic distribution of a document fed into an LDA model is also probabilistic.
I have an LDA model as presented here:
lda = models.LdaMulticore(corpus=corpus, id2word=dictionary, num_topics=numTopics,passes=10)
as well as two documents, Doc1 and Doc2. I want to find the cosine similarity of the two documents in lda space, so that:
x = cossim(lda[Doc1], lda[Doc2]).
The problem I'm noticing is that when I run this through multiple iterations, the cosine similarity is not always identical. (even when I use the same saved LDA model). The similarity is extremely similar, but it's always a bit off each time. In my actual code I have hundreds of documents, so I'm converting the topic distributions to a dense vector and using numpy to do the calculations in a matrix:
documentsList = np.array(documentsList)
calcMatrix=1-cdist(documentsList, documentsList, metric=self.metric)
Am I running into a rounding error with numpy (or another bug in my code), or is this behavior I should expect when using lda to find the topic distribution of a document?
Edit: I'm going to run a simple cosine similarity on 2 different documents using my lda model, and plot the spread of results. I will report back with what I find.
Ok, here are the results of running cossine similarity against 2 documents, using the same LDA model.
Here is my code:
def testSpacesTwoDocs(doc1, doc2, dictionary):
simList = []
lda = gensim.models.ldamodel.LdaModel.load('LDA_Models/lda_bow_behavior_allFields_t385_p10')
for i in range(50):
doc1bow = dictionary.doc2bow(doc1)
doc2bow = dictionary.doc2bow(doc2)
vec1 = lda[doc1bow]
vec2 = lda[doc2bow]
S = matutils.cossim(vec1, vec2)
simList.append(S)
for entry in simList:
print entry
sns.set_style("darkgrid")
plt.plot(simList, 'bs--')
plt.show()
return
Here are my results: 0.0082616863035, 0.00828413767524, 0.00826550453411, 0.00816756826185, 0.00829832701338, 0.00828970584276, 0.00828578705814, 0.00817109902484, 0.00817138247141, 0.00825297374028, 0.008269435921, 0.00826470121538, 0.00818282042634, 0.00824660449673, 0.00818087532906, 0.0081770261766, 0.00817128310123, 0.00817643202588, 0.00827404791376, 0.00832439428054, 0.00816643128216, 0.00828540881955, 0.00825746652101, 0.00816793513824, 0.00828471827526, 0.00827161219003, 0.00817773114553, 0.00826166001503, 0.00828048713541, 0.00817435544365, 0.0082956702812, 0.00826167470288, 0.00829873425476, 0.00825744872634, 0.00826802120149, 0.00829604894909, 0.0081776752236, 0.00817613482849, 0.00825839326441, 0.00817530362838, 0.0081747561999, 0.0082597447174, 0.00828958180101, 0.00827157760835, 0.00826939127657, 0.00826138381094, 0.00817755590806, 0.00827135780051, 0.00827314260067, 0.00817035250043
Am I correct to assume that the LDA model is infering the topic distribution of both documents at each iteration, and thus that the cosine similarities are stochastic rather than determanistic? Is this much variation a sign that I'm not training my model long enough? Or am I not properly normalizing the vectors? Thanks
Thanks