1
votes

I know that K-Means is a lazy learner and will have to be retrained from scratch with new points, but still would like to know if there's any workaround to use a trained model to predict on a new unseen data.

I'm using K-Means algorithm to cluster a medical corpus. I'm creating a term-document matrix to represent this corpus. Before feeding the data to kmeans algorithm, I perform truncated singular value decomposition on the data for dimensionality reduction. I've been thinking if there's a way to cluster a new unseen document without retraining the entire model.

To get the vector representation of the new document and predict its cluster using the trained model, I need to ensure that it has the same vocabulary as that of the trained model and also maintains the same order in the term-document matrix. This can be done considering that these documents have a similar kind of vocabulary. But, how do I get SVD representation of this document? Now here's where my understanding gets a little shaky, so correct me if I'm wrong but to perform SVD on this vector representation, I'll need to append it to the original term-document matrix. Now, if I append this new document to original term-document matrix and perform SVD on it to get the vector representation with limited features (100 in this case), then I'm not sure how things will change? Will the new features selected by the SVD correspond semantically to that of the original ones? i.e. it won't make sense to measure the distance of new document from cluster centroids (with 100 features) if the corresponding features grasp different concepts.

Is there a way to use a trained kmeans model for new text data? Or any other better-suited clustering approach for this task?

1

1 Answers

2
votes

You problem isn't k-means, a simple nearest-neighbor classificator using the means as data will work.

Your problem is SVD, which is not stable. Adding new data can give you entirely different results.