Clustering text documents and obtaining duplicate top terms

Question

I found the code in this post to be very helpful. (I would add a comment to that post but I need 50 reputation points.)

I used the same code in the post above but added a test document I have been using for debugging my own clustering code. For some reason a word in 1 document appears in both clusters.

The code is:

Update: I added "Unique sentence" to the documents below.

documents = ["I ran yesterday.", 
                "The sun was hot.", 
                "I ran yesterday in the hot and humid sun.", 
                "Yesterday the sun was hot.",
                "Yesterday I ran in the hot sun.",
                "Unique sentence." ]

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

#cluster documents    
true_k = 2
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)

#print top terms per cluster clusters    
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print ("Cluster %d:" % i,)
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind])
    print

The output I receive is:

UPDATE: I updated the output below to reflect the "unique sentence" above.

Cluster 0: sun hot yesterday ran humid unique sentence Cluster 1: unique sentence yesterday sun ran humid hot

You'll note that "humid" appears as a top term in both clusters even though it's in just 1 line of the documents above. I would expect a unique word, like "humid" in this case, to be a top term in just 1 of the clusters.

Thanks!

fbo fbo · Accepted Answer · 2016-08-29T18:26:23

TF*IDF tells you the representativeness of a word (in this case the column) for a specific document (and in this case the row). By representative I mean: a word occurs frequently in one document but not frequently in other documents. The higher the TF*IDF value, the more this word represents a specific document.

Now let's start to understand the values you actually work with. From sklearn's kmeans you use the return variable cluster_centers. This gives you the coordinates of each cluster, which are an array of TF*IDF weights, for each word one. It is important to note that these are just some abstract form of word frequency and do no longer relate back to a specific document. Next, numpy.argsort() gives you the indices that would sort an array, starting with the index for the lowest TF*IDF value. So after that you reverse its order with [:, ::-1]. Now you have the index of the most representative words for that cluster center at the beginning.

Now, let's talk a bit more about k-means. k-means initialises it's k-cluster centers randomly. Then the each document is assigned to a center and then the cluster centers are recomputed. This is repeated until the optimization criterion to minimize the sum of sqaured distances between documents and their closest center is met. What this means for you is that each cluster dimension most likely doesn't have the TF*IDF value 0 because of the random initialisation. Furthermore, k-means stops as soon as the optimization criterion is met. Thus, TF*IDF values of a center mean just that the TF*IDF of the documents that were assigned to the other clusters are closer to this center than to the other cluster centers.

One additional bit is that with order_centroids[i, :10], the 10 most representative words for each cluster are printed, but since you have only 5 words in total, all words will be printed either way just in a different order.

I hope this helped. By the way k-means does not guarantee you to find the global optimum and might get stuck in a local optimum, that's why it is usually run multiple times with different random starting points.

Clustering text documents and obtaining duplicate top terms

3 Answers