1
votes

So I'm trying to figure out how to interpret/analyse this clustering output I have. I have 50 folders, called clusters-0, clusters-1, clusters-2 and so on. This is because I said '-k 50' in my command. I thought these folders each contained one cluster, but now I'm not sure.

Using '--help' kmeans says that the '-cl' switch will: "If present, run clustering after the iterations have taken place."

So, does that mean that you need to use '-cl' for the clustering to actually happen?

If "-cl" is not used, are all those fifty folders just iterations of the k-means algorithm output and it doesn't produce an output that actually has the clusters.

Does each of those folders contain fifty clusters, and the final one is the best, most refined set of clusters?

1
Stack Overflow is for people who are knowledgeable and willing to help others. If you don't know the answer to the question you should not post. For example, the answer below from user2536804 is very helpful. If you have some insight into this subject, try and post an answer like that.efx

1 Answers

2
votes

About the folder structure that Mahout Kmeans generate:

/clusters - contains initial centroids of the clusters, based on these points distance measures are found for each individual data points.

/output/clusterPoints - contains the sequenceFile which has cluster id and data used for clustering in (key,value) format.

/output/clusters-* - Each of these folder contains data about the newly computed cluster centroid for each iterations.

/output/clusters-*-final - contains the final cluster details Heres what I have in it.

  VL-1123{n=615 c=[0.655, 0.175, -1.042] r=[0.254, 0.086, 0.271]}
  VL-376{n=1607 c=[-0.068, 0.184, 0.787] r=[0.152, 0.020, 0.113]}
  VL-3492{n=375 c=[0.616, 0.111, 0.803] r=[0.289, 0.068, 0.227]}
  VL-347{n=507 c=[-0.496, 0.166, 0.574] r=[0.169, 0.078, 0.196]}
  VL-992{n=595 c=[0.154, 0.267, -0.394] r=[0.212, 0.083, 0.282]}
  VL-2468{n=189 c=[-0.696, -0.008, -0.494] r=[0.247, 0.213, 0.372]}

Here I have 6 clusters, so it gives

ClusterID(1123), number of record in cluster(n=615), cluster centroid(c) and radius(r)

Also, VL represents the clusters have converged and it`s a good thing. Hope it helps!!