0
votes

I've an LDA topic model trained using MALLET but I want compute the cosine similarity between two documents to get the similarity but I'm not sure which file that MALLET outputs do I compute the cosine of.

My cosine similarity function is working fine but just not sure what I'm comparing in MALLET.

Any help would be appreciated!

1

1 Answers

2
votes

Each document will be represented by its topic composition, so you have to compare those. Use the --output-doc-topics parameter in order to get the needed file.

The rows are the documents and the columns are the proportions of each topic belonging to the document. In the current version (2.0.8) the columns are sorted ascending by topic ID - otherwise they are sorted from highest to lowest probability.

You should also consider different metrics apart from cosine similiarity, e.g. the (symmetric) Kullback-Leibler divergence or the Hellinger distance.