0
votes

I've been using Mahout to k-means cluster text documents using both XML and SOLR index input.

The clustering appears to work, and similar documents are indeed being put in the same k-means cluster, which is great.

However, whenever I display the graphml output using ClusterDump (--outputFormat GRAPH_ML) I get a plot showing all the clusters, but with each element displayed around the circumference of its parent cluster, meaning each element has approximately the same radius from the centroid.

I was expecting the elements to be scattered throughout the cluster depending on their similarity to each other (as in the Mahout examples).

Has anyone seen anything similar with their Mahout k-means clusters? I have tried to get to the bottom of this myself, but any hints or suggestions would be a huge help.

With much thanks,

P Morris

1

1 Answers

0
votes

Please can you explain how you succeed to cluster solr index input with mahout and kmeans algo?

BTW my output (clusters_dump) when i clusterize .txt file looks like:

CL-0{n=0 c=[0:1.000, 1:1.000, 2:3.162, 3:1.000, 4:4.796, 6:1.000, 7:1.000, 8:1.000, 9:1.000, 10:1.000, 11:1.000, 12:4.690, 14:1.000, 15:11.446, 16:4.359] r=[]}

CL-1{n=0 c=[0:1.000, 1:1.000, 2:3.162, 3:1.000, 6:1.000, 7:1.000, 8:1.000, 9:1.000, 10:1.000, 11:1.000, 14:1.000, 15:11.446] r=[]}

CL-2{n=0 c=[4:1.000, 12:1.000, 13:8.315, 16:1.000] r=[]}

because I specified number of clusters 3.