0
votes

I have set of short documents(1 or 2 paragraph each). I have used three different approaches for document similarity: - simple cosine similarity on tfidf matrix - applying LDA on the whole corpus and then using the LDA model to create the vector for each document then I applied cosine similarity. -applying LSA on the whole corpus and then using the LSA model to create the vector for each document then I applied cosine similarity.

Based on experiments I am getting better result on simple cosine similarty on tfidf matrix without any LDA or LSA. Based on what I read LDA or LSA should improve the result, but in my case it is not! Is there any idea why LDA or LSA have worse results? both LDA and LSA when trained for more than 1000 rounds find similarity between some documents with probability higher than 90% which are totally unrelated!

Is there any justification for that?

Thanks

1

1 Answers

0
votes

I have used LDA4j implementation and got better results than TFIDF, and similarly for LSI i have used semantic-vector implementation. If you have your own implementation share the model sketch. One more thing you should need to normalize the corpus for better results.