K-means clustering on text data?

Question

For simpler understanding I am explaining with smaller example. I have 2 sets :
I have 10 unique string ids. id1,id2,id3,id4,id5... id10
I have 3 unique c-ids: cid1,cid2,cid3
There is a mapping between 2 sets but not within the values of same sets.
The mapping is say :
id1 : cid1,cid2
id2 : cid3
id3 : cid1 ... so on..

I need to cluster set of ids(strings) against cids(strings) and vice a versa.

Right now I have created a csv file like below. (similar to sparse)

id1 , cid1
id1 , cid2
id3 , cid3
.

.

I run the k-means in Weka but not sure if this is the right way. All those ids are actually features / attributes which do not have any specific order. But the way I am representing , the columns are treated as attribute values. How can I convert it into features?

has it got to be in weka? (are you be willing to try some other tool?) — Ashesh
I am willing to try any other tool. Please let me know. Thanks — user2793286

Manoj Awasthi Manoj Awasthi · Accepted Answer · 2015-04-16T03:35:21

For kmeans you have to create equal length vectors. One possible way is - given there are three unique Ids cid1, cid2 and cid3 so you create a vector of length 3 each taking a binary value (0 or 1) denoting the absence or presence of that unique id.

id => [cid1, cid2, cid3]

i.e. above examples can be written as:

id1,1,1,0
id2,0,0,1
id3,1,0,1
...

Then I think you can cluster using kmeans. I do not know the semantics of Ids here so can't really comment on how well will it cluster.

K-means clustering on text data?

2 Answers