I would like to implement an algorithm for clustering and implement it in Lucene. For that, I need the tf-idf term vector that represents the document, so I could represent the centroids the same way documents are represented, find the similarity between documents and clusters, and update the centroids, by calculating its new feature values. But how can I do that on top of Lucene?
Can I even get tf-idf?
I know that term frequency in each document is saved, but does that mean that I would need to calculate idf 'manually' for each term? And how to make vectors then to use them for clustering.
Thanks