Using Kmeans with TF-IDF vectorizer is it possible to get terms occurring in more than one cluster?
Here is the dataset of examples:
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
I use TF-IDF vectorizer for feature extraction:
vectorizer = TfidfVectorizer(stop_words='english')
feature = vectorizer.fit_transform(documents)
true_k = 3
km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
km.fit(feature)
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
print "Top terms per cluster:"
for i in range(true_k):
print "Cluster %d:" % i,
for ind in order_centroids[i, :10]:
print ' %s,' % terms[ind],
print
When i cluster the documents using KMeans from scikit-learn, the results are below:
Top terms per cluster:
Cluster 0: user, eps, interface, human, response, time, computer, management, engineering, testing,
Cluster 1: trees, intersection, paths, random, generation, unordered, binary, graph, interface, human,
Cluster 2: minors, graph, survey, widths, ordering, quasi, iv, trees, engineering, eps,
We can see some terms occur in more than one cluster(e.g, graph
in cluster 1 and 2,eps
in cluster 0 and 2).
Are the cluster results wrong? or is it acceptable because the tf-idf score for the terms above for each document are different?
KMeans
but how do you vectorize your data ? – MMF