ReduceByKey for a HashMap based RDD

Question

I have a RDD A of the form tuple (key,HashMap[Int, Set(String)]) which I want to convert to a new RDD B (key, HashMap[Int, Set(String)) where the latter RDD has unique keys and the value for each key k is union of all sets for key k in RDD A.

For example,

RDD A

(1,{1->Set(3,5)}), (2,{3->Set(5,6)}), (1,{1->Set(3,4), 7->Set(10, 11)})

will convert to

RDD B

(1, {1->Set(3,4,5), 7->Set(10,11)}), (2, {3->Set(5,6)})

I am not able to formulate a function for this in Scala as I am new to the language. Any help would be appreciated.

Thanks in advance.

Alper t. Turker Alper t. Turker · Accepted Answer · 2017-07-27T11:08:33

cats Semigroup would be a great fit here. Add

spark.jars.packages org.typelevel:cats_2.11:0.9.0

to the configuration and use combine method:

import cats.implicits._

val rdd = sc.parallelize(Seq(
  (1, Map(1 -> Set(3,5))),
  (2, Map(3 -> Set(5,6))),
  (1, Map(1 -> Set(3,4), 7 -> Set(10, 11)))

rdd.reduceByKey(_ combine _)

ReduceByKey for a HashMap based RDD

1 Answers