1
votes

I am trying to implement Kmeans Clustering in Spark Scala.

So currently I have an RDD which looks like this- It contains the cluster centers alond with the data point.

scala> res2.collect
res54: Array[(Int, Array[Any])] = Array((2,Array(19, 15, 39)), (2,Array(21, 15, 81)), (2,Array(20, 16, 6)), (1,Array(23, 16, 77)), (2,Array(31, 17, 40)), (3,Array(22, 17, 76)), (1,Array(35, 18, 6)), (3,Array(23, 18, 94)), (1,Array(64, 19, 3)), (1,Array(30, 19, 72)))

My next step is to sum up the arrays elementwise based on their keys and divide the result by the count (to find the new set of centroids by averaging).

I am not able to figure how to achieve this since simply using reduceByKey(__+_) won't work for arrays.

2

2 Answers

2
votes

ReduceByKey should work with arrays. By count I assume you mean the number of elements associated with a given key, which can be found with countByKey(). This approach should work:

val keycount = your_rdd.countByKey()
def reduceSum(a1: Array[Int], a2: Array[Int]): Array[Int] = {
  Array(a1(0)+a2(0),a1(1)+a2(1),a1(2)+a2(2))
}
val summed = your_rdd.reduceByKey(reduceSum)
val combined = keycount.join(summed)
combined.map{
  case (idx, (count, arr)) => arr.map(1.0*arr / count) 
}
0
votes

Here is my solution:

import spark.implicits._

  val df = spark.sparkContext.parallelize(Array((2,Array(19, 15, 39)), (2,Array(21, 15, 81)), (2,Array(20, 16, 6)), (1,Array(23, 16, 77)), (2,Array(31, 17, 40)), (3,Array(22, 17, 76)), (1,Array(35, 18, 6)), (3,Array(23, 18, 94)), (1,Array(64, 19, 3)), (1,Array(30, 19, 72))))

  df.map(line => {

    val sumValues = line._2.sum/line._2.size;

    (line._1, sumValues)

  } ).reduceByKey(_+_).toDF("key","avg").show();

output

|key|avg|
+---+---+
|  1|125|
|  3| 83|
|  2|106|
+---+---+