I aim to apply a kmeans clustering algorithm to a very large data set using Spark (1.3.1) MLLib. I have called the data from an HDFS using a hiveContext from Spark, and would eventually like to put it back there that way - in this format
|I.D |cluster |
===================
|546 |2 |
|6534 |4 |
|236 |5 |
|875 |2 |
I have ran the following code, where "data" is a dataframe of doubles, and an ID for the first column.
val parsedData = data.rdd.map(s => Vectors.dense(s.getDouble(1),s.getDouble(2))).cache()
val clusters = KMeans.train(parsedData, 3, 20)
This runs successfully, I'm stuck now mapping the clusters back to their respective IDs, in a dataframe as described above. I can convert it to a datframe with:
sc.makeRDD(clusters.predict(parsedData).toArray()).toDF()
But that's as far as I've got. This post is on the right track, and this post I think is asking a similar question to mine.
I suspect the labeledPoint library is needed. Any comments,answers would be appreciated, cheers.
Edit: Just found this in the Spark userlist, looks promising