1
votes

On the scikit-learn site there is an example of k-means applied to text mining. The excerpt of interest is below:

if opts.n_components:
    original_space_centroids = svd.inverse_transform(km.cluster_centers_)
    order_centroids = original_space_centroids.argsort()[:, ::-1]
else:
    order_centroids = km.cluster_centers_.argsort()[:, ::-1]

terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i, end='')
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind], end='')
    print()

(link to example)

My first question is with regards to km.cluster_centers_. Each term is actually a dimension, and thus the cluster center value by term is the "location" of each cluster in the term dimension. Are these values sorted because higher values for a specific term in each term dimension represent the "strength" of the term to represent the cluster? If so, please explain why this is so.

Secondly, the algorithm provides an option to perform LSA on the term document matrix prior to clustering. May I assume that this helps with uniqueness of each cluster through orthogonality of the dimensions? Or is there something else that this does? Is it typical to perform SVD prior to clustering?

Thanks in advance!

1

1 Answers

0
votes

Let's start with question one! Each km.cluster_centers_ represents the centre of a cluster in your 'term frequency space'. Doing a (reverse in this case) argsort on these gives the terms which have the highest frequency for these cluster centres, as a high number represents a higher term frequency. Doing the reverse of this would also be interesting and it would show the terms which have a low frequency for your cluster.

tl;dr: the terms are sorted to show the terms which have the highest frequency in the cluster

Now the second question. The components from the LSA are orthogonal, however this doesn't mean the projection of your data using the LSA is orthogonal. The LSA has been used as a dimensionality reduction technique, so essentially it gets rid of the meaningless information from your term matrix, meaning that the clusters should be more informative as they won't have been derived from potentially noisy features. Performing dimensionality reduction prior to clustering really depends on your data, but in general it can't hurt, but it will add time to any calculations.

tl;dr No the LSA is used to reduce the dimensions to improve the clustering.

Hope that helps.