I am using python Kmean clustering algorithm for cluster document. I have created a term-document matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
vectorizer = TfidfVectorizer(tokenizer=tokenize, encoding='latin-1',
stop_words='english')
X = vectorizer.fit_transform(token_dict.values())
Then I applied Kmean clustering using following code
km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
y=km.fit(X)
My next task is to see the top terms in every cluster, searching on googole suggested that many of the people has used the km.cluster_centers_.argsort()[:, ::-1] for finding the top term in the clusters using the following code:
print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i, end='')
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind], end='')
print()
Now my question is that to my understanding km.cluster_centers_ returns the coordinated of the center of the clusters so for example if there are 100 features and three clusters it would return us a matrix of 3 rows and 100 column representing a centroid for each cluster. What I wish to understand how it is used in the above code to determine the top terms in the cluster. Thanks Any comments are appreciated Nadeem
TfidfVectorizer
andKMeans
come from? You'll probably get a better response if you target this at experts in that package. Specifically, this information can go in the tags as well as the question itself and the main body of your text. – Andrew Jaffe