On the scikit-learn site there is an example of k-means applied to text mining. The excerpt of interest is below:
if opts.n_components:
original_space_centroids = svd.inverse_transform(km.cluster_centers_)
order_centroids = original_space_centroids.argsort()[:, ::-1]
else:
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i, end='')
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind], end='')
print()
(link to example)
My first question is with regards to km.cluster_centers_. Each term is actually a dimension, and thus the cluster center value by term is the "location" of each cluster in the term dimension. Are these values sorted because higher values for a specific term in each term dimension represent the "strength" of the term to represent the cluster? If so, please explain why this is so.
Secondly, the algorithm provides an option to perform LSA on the term document matrix prior to clustering. May I assume that this helps with uniqueness of each cluster through orthogonality of the dimensions? Or is there something else that this does? Is it typical to perform SVD prior to clustering?
Thanks in advance!