1
votes

I found the code in this post to be very helpful. (I would add a comment to that post but I need 50 reputation points.)

I used the same code in the post above but added a test document I have been using for debugging my own clustering code. For some reason a word in 1 document appears in both clusters.

The code is:

Update: I added "Unique sentence" to the documents below.

documents = ["I ran yesterday.", 
                "The sun was hot.", 
                "I ran yesterday in the hot and humid sun.", 
                "Yesterday the sun was hot.",
                "Yesterday I ran in the hot sun.",
                "Unique sentence." ]

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

#cluster documents    
true_k = 2
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)

#print top terms per cluster clusters    
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print ("Cluster %d:" % i,)
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind])
    print

The output I receive is:

UPDATE: I updated the output below to reflect the "unique sentence" above.

Cluster 0: sun hot yesterday ran humid unique sentence Cluster 1: unique sentence yesterday sun ran humid hot

You'll note that "humid" appears as a top term in both clusters even though it's in just 1 line of the documents above. I would expect a unique word, like "humid" in this case, to be a top term in just 1 of the clusters.

Thanks!

3

3 Answers

4
votes

TF*IDF tells you the representativeness of a word (in this case the column) for a specific document (and in this case the row). By representative I mean: a word occurs frequently in one document but not frequently in other documents. The higher the TF*IDF value, the more this word represents a specific document.

Now let's start to understand the values you actually work with. From sklearn's kmeans you use the return variable cluster_centers. This gives you the coordinates of each cluster, which are an array of TF*IDF weights, for each word one. It is important to note that these are just some abstract form of word frequency and do no longer relate back to a specific document. Next, numpy.argsort() gives you the indices that would sort an array, starting with the index for the lowest TF*IDF value. So after that you reverse its order with [:, ::-1]. Now you have the index of the most representative words for that cluster center at the beginning.

Now, let's talk a bit more about k-means. k-means initialises it's k-cluster centers randomly. Then the each document is assigned to a center and then the cluster centers are recomputed. This is repeated until the optimization criterion to minimize the sum of sqaured distances between documents and their closest center is met. What this means for you is that each cluster dimension most likely doesn't have the TF*IDF value 0 because of the random initialisation. Furthermore, k-means stops as soon as the optimization criterion is met. Thus, TF*IDF values of a center mean just that the TF*IDF of the documents that were assigned to the other clusters are closer to this center than to the other cluster centers.

One additional bit is that with order_centroids[i, :10], the 10 most representative words for each cluster are printed, but since you have only 5 words in total, all words will be printed either way just in a different order.

I hope this helped. By the way k-means does not guarantee you to find the global optimum and might get stuck in a local optimum, that's why it is usually run multiple times with different random starting points.

1
votes

Not necessarily. The code you are using creates vector space of the bag of words (excluding stop words) of your corpus (I am ignoring the tf-idf weighting.). Looking at your documents, your vector space is of size 5, with the a word array like (ignoring the order):

word_vec_space = [yesterday, ran, sun, hot, humid]

Each document is assigned a numeric vector of whether it contains the words in 'word_vec_space'.

"I ran yesterday." -> [1,1,0,0,0]
"The sun was hot." -> [0,0,1,1,0]
...

When performing k-mean clustering, you pick k starting points in the vector-space and allow the points to move around to optimize the clusters. You ended up with both cluster centroids containing the a non-zero value for 'humid'. This is due to the one sentence that contains 'humid' also had 'sun', 'hot', and 'yesterday'.

1
votes

Why would clusters have distinct top terms?

Consider the clustering worked (very often it doesn't - beware), would you consider these clusters to be bad or good:

  • banana fruit
  • apple fruit
  • apple computer
  • windows computer
  • window blinds

If I would ever get such clusters, I wouldmbe happy (iknfact, I would believe I am seeing an error, because these are much too god results. Text clustering is always borderline to non-working).

With text clusters, it's a lot about word combinations, not just single words. Apple fruit and apple computer are not the same.