4
votes

I am using python Kmean clustering algorithm for cluster document. I have created a term-document matrix

   from sklearn.feature_extraction.text import TfidfVectorizer
   from sklearn.cluster import KMeans
   vectorizer = TfidfVectorizer(tokenizer=tokenize, encoding='latin-1',
                          stop_words='english')
    X = vectorizer.fit_transform(token_dict.values())

Then I applied Kmean clustering using following code

 km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
 y=km.fit(X)

My next task is to see the top terms in every cluster, searching on googole suggested that many of the people has used the km.cluster_centers_.argsort()[:, ::-1] for finding the top term in the clusters using the following code:

 print("Top terms per cluster:")
 order_centroids = km.cluster_centers_.argsort()[:, ::-1]
 terms = vectorizer.get_feature_names()
 for i in range(true_k):
     print("Cluster %d:" % i, end='')
     for ind in order_centroids[i, :10]:
         print(' %s' % terms[ind], end='')
         print()

Now my question is that to my understanding km.cluster_centers_ returns the coordinated of the center of the clusters so for example if there are 100 features and three clusters it would return us a matrix of 3 rows and 100 column representing a centroid for each cluster. What I wish to understand how it is used in the above code to determine the top terms in the cluster. Thanks Any comments are appreciated Nadeem

2
I'm sure I could look it up, but what library do TfidfVectorizer and KMeans come from? You'll probably get a better response if you target this at experts in that package. Specifically, this information can go in the tags as well as the question itself and the main body of your text.Andrew Jaffe
i have now mentioned the libraries i have used which are from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.cluster import KMeansNhqazi

2 Answers

4
votes

You're correct about the shape and meaning of the cluster centers. Because you're using Tf-Idf vectorizer, your "features" are the words in a given document (and each document is its own vector). Thus, when you cluster the document vectors, each "feature" of the centroid represents the relevance of that word to it. "word" (in vocabulary)="feature" (in your vector space)="column" (in your centroid matrix)

The get_feature_names call gets the mapping of column index to the word it represents (so it seems from the documentation... if that doesn't work as expected, just reverse the vocabulary_ matrix to get the same result).

Then the .argsort()[:, ::-1] line converts each centroid into a sorted (descending) list of the columns most "relevant" (highly-valued) in it, and hence the words most relevant (since words=columns).

The rest of the code is just printing, I'm sure that doesn't need any explaining. All the code is really doing is sorting each centroid in descending order of the features/words most valued in it, then mapping those columns back to their original words and printing them.

0
votes

A little late to the game but I had the same question but coulnd't find a satisfactory answer.

Here's what I did:

from sklearn.cluster import MiniBatchKMeans
from sklearn.feature_extraction.text import TfidfVectorizer

# documents you are clustering
docs = ['first document', 'second', 'thrid doc', 'etc.']

# init vectorizer
tfidf = TfidfVectorizer()

# fit vectorizer
tfidf.fit(docs)

# create vecs for your sents
vecs = tfidf.transform(docs)

# fit your kmeans cluster to vecs
# don't owrry about the hyperparameters
clusters = MiniBatchKMeans(
    n_clusters=16, 
    init_size=1024, 
    batch_size=2048, 
    random_state=20
).fit_predict(vecs)

# get dict of {keyword id: keyword name}
labels = tfidf.get_feature_names()

def get_cluster_keywords(vecs, clusters, docs, labels, top_n=10):
    # init a dict where we will count term occurence
    cluster_keyword_ids = {cluster_id: {} for cluster_id in set(clusters)}
    
    # loop through the vector, cluster and content of each doc
    for vec, i, sent in zip(vecs, clusters, docs):
        
        # inspect non zero elements of rows of sparse matrix
        for j in vec.nonzero()[1]:
            
            # check we have seen this keword in this cluster before
            if j not in cluster_keywords[i]:
                cluster_keyword_ids[i][j] = 0
            
            # add a count
            cluster_keyword_ids[i][j] += 1

    # cluster_keyword_ids contains ids
    # we need to map back to keywords
    # do this with the labels param
    return {
        cluster_id: [
            labels[keyword_id] # map from kw id to keyword
            
            # sort through our keyword_id_counts
            # only return the top n per cluster
            for keyword_id, count in sorted(
                keyword_id_counts.items(),
                key=lambda _id, count: count, # sort from highest count to lowest
                reverse=True
            )[:top_n]
        ] for cluster_id, keyword_id_counts in cluster_keyword_ids.items()
    }