2
votes

Using Kmeans with TF-IDF vectorizer is it possible to get terms occurring in more than one cluster?

Here is the dataset of examples:

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

I use TF-IDF vectorizer for feature extraction:

vectorizer = TfidfVectorizer(stop_words='english')
feature = vectorizer.fit_transform(documents)
true_k = 3
km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
km.fit(feature)
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
print "Top terms per cluster:"
for i in range(true_k):
    print "Cluster %d:" % i,
    for ind in order_centroids[i, :10]:
        print ' %s,' % terms[ind],
    print

When i cluster the documents using KMeans from scikit-learn, the results are below:

Top terms per cluster:
Cluster 0:  user,  eps,  interface,  human,  response,  time,  computer,  management,  engineering,  testing,
Cluster 1:  trees,  intersection,  paths,  random,  generation,  unordered,  binary,  graph,  interface,  human,
Cluster 2:  minors,  graph,  survey,  widths,  ordering,  quasi,  iv,  trees,  engineering,  eps,

We can see some terms occur in more than one cluster(e.g, graph in cluster 1 and 2,eps in cluster 0 and 2).

Are the cluster results wrong? or is it acceptable because the tf-idf score for the terms above for each document are different?

1
Ok you use KMeans but how do you vectorize your data ?MMF
using TfidfVectorizer with stopwords from my language stopword list.Ardi Tan
can you show your code please that performs the clustering ?MMF
I edited the question above :)Ardi Tan
@ArdiTan can you show how you are getting the top terms per cluster?João Almeida

1 Answers

0
votes

I think you are a bit confused on what you are trying to do. The code you use gives you the clustering of the documents, not the terms. The terms are the dimensions where you are clustering.

If you want to find to which cluster each document belongs you just need to use the predict or the fit_predict method, like this:

vectorizer = TfidfVectorizer(stop_words='english')
feature = vectorizer.fit_transform(documents)
true_k = 3
km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
km.fit(feature)
for n in range(9):
    print("Doc %d belongs to cluster %d. " % (n, km.predict(feature[n])))

And you get:

Doc 0 belongs to cluster 2. 
Doc 1 belongs to cluster 1. 
Doc 2 belongs to cluster 2. 
Doc 3 belongs to cluster 2. 
Doc 4 belongs to cluster 1. 
Doc 5 belongs to cluster 0. 
Doc 6 belongs to cluster 0. 
Doc 7 belongs to cluster 0. 
Doc 8 belongs to cluster 1. 

Take a look at the User Guide of Scikit-learn