You can't parallelize
a Map
, as parallelize
takes a Seq
. What you can achieve is creating an RDD[(Int, Int)]
, which however does not enforce the uniqueness of keys. To perform operation by key, you can leverage PairRDDFunctions
, that despite this limitation can end up being useful for your use case.
Let's try at least to get an RDD[(Int, Int)]
.
You used a slightly "wrong" syntax when applying the map
to your range.
The _
placeholder can have different meanings depending on the context. The two meanings that got mixed up in your snippet of code are:
- a placeholder for an anonymous function parameter that is not going to be used (as in
(_ => 42)
, a function which ignores its input and always returns 42
)
- a positional placeholder for arguments in anonymous functions (as in
(_, 42)
a function that takes one argument and returns a tuple where the first element is the input and the second is the number 42
)
The above examples are simplified and do not account for type inference as they only wish to point out two of the meanings of the _
placeholder that got mixed up in your snippet of code.
The first step is to use one of the two following functions to create the pairs that are going to be part of the map, either
a => (a, 0)
or
(_, 0)
and after parallelizing it you can get the RDD[(Int, Int)]
, as follows:
val pairRdd = sc.parallelize((0 to 20).map((_, 0)))
I believe it's worth noting here, that mapping on the local collection is going to be executed eagerly and bound to your driver's resources, while you can obtain the same final result by parallelizing the collection first and then mapping the pair-creating function on the RDD
.
Now, as mentioned you don't have a distributed map, but rather a collection of key-value pairs where the key uniqueness is not enforced. But you can work pretty seamlessly with those values using PairRDDFunctions
, which you obtain automatically by importing org.apache.spark.rdd.RDD.rddToPairRDDFunctions
(or without having to do anything in the spark-shell
as the import has already been done for your), which will decorate your RDD
leveraging Scala's implicit conversions.
import org.apache.spark.rdd.RDD.rddToPairRDDFunctions
pairRdd.mapValues(_ + 1).foreach(println)
will print the following
(10,1)
(11,1)
(12,1)
(13,1)
(14,1)
(15,1)
(16,1)
(17,1)
(18,1)
(19,1)
(20,1)
(0,1)
(1,1)
(2,1)
(3,1)
(4,1)
(5,1)
(6,1)
(7,1)
(8,1)
(9,1)
You can learn more about working with key-value pairs with the RDD API on the official documentation.