How to create RDD[Map(Int,Int)] using Spark and Scala?

Question

I have the following simple code in Java. This code creates and fills the Map with 0 values.

Map<Integer,Integer> myMap = new HashMap<Integer,Integer>();
for (int i=0; i<=20; i++) { myMap.put(i, 0); }

I want to create a similar RDD using Spark and Scala. I tried this approach, but it returns me RDD[(Any) => (Any,Int)] instead of RDD[Map(Int,Int)]. What am I doing wrong?

val data = (0 to 20).map(_ => (_,0))
val myMapRDD = sparkContext.parallelize(data)

Leo C Leo C · Accepted Answer · 2018-01-07T01:41:20

In Scala, (0 to 20).map(_ => (_, 0)) would not compile, as it has invalid placeholder syntax. I believe you might be looking for something like below instead:

val data = (0 to 20).map( _->0 )

which would generate a list of key-value pairs, and is really just a placeholder shorthand for:

val data = (0 to 20).map( n => n->0 )

// data: scala.collection.immutable.IndexedSeq[(Int, Int)] = Vector(
//   (0,0), (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0), (9,0), (10,0),
//   (11,0), (12,0), (13,0), (14,0), (15,0), (16,0), (17,0), (18,0), (19,0), (20,0)
// )

A RDD is an immtable collection (e.g. Seq, Array) of data. To create a RDD of Map[Int,Int], you would expand data inside a Map which in turn gets placed inside a Seq collection:

val rdd = sc.parallelize(Seq(Map(data: _*)))

rdd.collect
// res1: Array[scala.collection.immutable.Map[Int,Int]] = Array(
//   Map(0 -> 0, 5 -> 0, 10 -> 0, 14 -> 0, 20 -> 0, 1 -> 0, 6 -> 0, 9 -> 0, 13 -> 0, 2 -> 0, 17 -> 0,
//       12 -> 0, 7 -> 0, 3 -> 0, 18 -> 0, 16 -> 0, 11 -> 0, 8 -> 0, 19 -> 0, 4 -> 0, 15 -> 0)
// )

Note that, as is, this RDD consists of only a single Map, and certainly you can assemble as many Maps as you wish in a RDD.

val rdd2 = sc.parallelize(Seq(
  Map((0 to 4).map( _->0 ): _*),
  Map((5 to 9).map( _->0 ): _*),
  Map((10 to 14).map( _->0 ): _*),
  Map((15 to 19).map( _->0 ): _*)
))

How to create RDD[Map(Int,Int)] using Spark and Scala?

3 Answers