I found the code in this post to be very helpful. (I would add a comment to that post but I need 50 reputation points.)
I used the same code in the post above but added a test document I have been using for debugging my own clustering code. For some reason a word in 1 document appears in both clusters.
The code is:
Update: I added "Unique sentence" to the documents below.
documents = ["I ran yesterday.",
"The sun was hot.",
"I ran yesterday in the hot and humid sun.",
"Yesterday the sun was hot.",
"Yesterday I ran in the hot sun.",
"Unique sentence." ]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)
#cluster documents
true_k = 2
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
#print top terms per cluster clusters
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print ("Cluster %d:" % i,)
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind])
print
The output I receive is:
UPDATE: I updated the output below to reflect the "unique sentence" above.
Cluster 0: sun hot yesterday ran humid unique sentence Cluster 1: unique sentence yesterday sun ran humid hot
You'll note that "humid" appears as a top term in both clusters even though it's in just 1 line of the documents above. I would expect a unique word, like "humid" in this case, to be a top term in just 1 of the clusters.
Thanks!