2
votes

I've implemented the Apache Spark example at

https://spark.apache.org/docs/1.1.0/mllib-clustering.html#examples

Here is the source :

import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors

// Load and parse the data
val data = sc.textFile("data/mllib/kmeans_data.txt")
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble)))

// Cluster the data into two classes using KMeans
val numClusters = 2
val numIterations = 20
val clusters = KMeans.train(parsedData, numClusters, numIterations)

// Evaluate clustering by computing Within Set Sum of Squared Errors
val WSSSE = clusters.computeCost(parsedData)
println("Within Set Sum of Squared Errors = " + WSSSE)

Using dataset :

0.0 0.0 0.0
0.1 0.1 0.1
0.2 0.2 0.2
9.0 9.0 9.0
9.1 9.1 9.1
9.2 9.2 9.2

I can extract the cluster centers using :

  println(clusters.clusterCenters.apply(0))
   println(clusters.clusterCenters.apply(1))

which returns

[9.1,9.1,9.1]
[0.10000000000000002,0.10000000000000002,0.10000000000000002]

But there are some items I'm not sure of, which does not seem to be supported by the API :

How can I extract what points have been added to each of the two clusters ?

How to add labels to each data point so that while viewing what points are in each cluster can also determine each points label ? Do I need to update the Spark Kmeans implementation to achieve this ?

2
In the first question, are you referring to the vectors that make up your rows? or the actual points within the vectors.Michael Plazzer

2 Answers

2
votes

if you are using java,

javaRDD cluster_indices = clusters.predict(parsedData);

as predict is overloaded.

0
votes

The method that you are looking for is predict() but does not belong to KMeans.scala. Is part of the class KMeansModel.scala (which is the return type of KMeans.train(...) )

The use would be:

    clusters.predict(data_to_cluster)