1
votes

For simpler understanding I am explaining with smaller example. I have 2 sets :
I have 10 unique string ids. id1,id2,id3,id4,id5... id10
I have 3 unique c-ids: cid1,cid2,cid3
There is a mapping between 2 sets but not within the values of same sets.
The mapping is say :
id1 : cid1,cid2
id2 : cid3
id3 : cid1 ... so on..

I need to cluster set of ids(strings) against cids(strings) and vice a versa.

Right now I have created a csv file like below. (similar to sparse)

id1 , cid1
id1 , cid2
id3 , cid3
.

.

I run the k-means in Weka but not sure if this is the right way. All those ids are actually features / attributes which do not have any specific order. But the way I am representing , the columns are treated as attribute values. How can I convert it into features?

2
has it got to be in weka? (are you be willing to try some other tool?)Ashesh
I am willing to try any other tool. Please let me know. Thanksuser2793286
I have added my answer, let me know if it helps.Ashesh

2 Answers

0
votes

For kmeans you have to create equal length vectors. One possible way is - given there are three unique Ids cid1, cid2 and cid3 so you create a vector of length 3 each taking a binary value (0 or 1) denoting the absence or presence of that unique id.

id => [cid1, cid2, cid3]

i.e. above examples can be written as:

id1,1,1,0
id2,0,0,1
id3,1,0,1
... 

Then I think you can cluster using kmeans. I do not know the semantics of Ids here so can't really comment on how well will it cluster.

0
votes

Since you are willing to try any other tool that will do the clustering, I recommend taking a look at SPMF.

SPMF is an open-source data mining mining library written in Java, specialized in pattern mining.

It is distributed under the GPL v3 license.

It offers implementations of 89 data mining algorithms for:

sequential pattern mining, association rule mining, itemset mining, sequential rule mining, clustering. The source code of each algorithm can be integrated in other Java software.

Moreover, SPMF can be used as a standalone program with a simple user interface or from the command line.

You can download the GUI program or the source code from here

Documentation and data-set description can be found on this page.


For KMeans the program accepts only integer values (there is a workaround for strings) separated by single spaces and also it assumes that all rows have the same length.

1 2 3 4
1 6 8 8
1 2 3 3
2 4 5 5
4 7 8 7
7 6 8 9
4 4 3 3
2 2 5 5
7 5 5 5
5 6 8 9

The output file format is defined as follows. Each line is a cluster and lists the vectors contained in the cluster. A vector is a list of double values separated by "," and between the "[" and "]" characters.

cluster 1: [1.0,2.0,3.0,4.0][1.0,2.0,3.0,3.0][2.0,4.0,5.0,5.0][4.0,4.0,3.0,3.0][2.0,2.0,5.0,5.0]
cluster 2: [7.0,6.0,8.0,9.0][1.0,6.0,8.0,8.0][4.0,7.0,8.0,7.0][5.0,6.0,8.0,9.0]
cluster 3: [7.0,5.0,5.0,5.0]

However if your data-set has only a few distinct strings a "find and replace" will do the job.

In any other case you can use R