We use scala parallel collections extensively in Spark rdd.mapPartitions(...) function. It works perfectly for us, we are able so scale IO intensive jobs very well (calling Redis/HBase/etc...)
BIG WARN: Scala parallel collections are not lazy! when you construct par-iterator it actually brings all rows from Iterator[Row] into memory. We use it mostly in Spark-Streaming context, so it's not an issue for us. But it's a problem when we want for example to process huge HBase table with Spark
private def doStuff(rows: Iterator[Row]): Iterator[Row] = {
val pit = rows.toIterable.par
pit.tasksupport = new ExecutionContextTaskSupport(ExecutionContext.fromExecutor(....)
pit.map(row => transform(row)).toIterator
}
rdd.mapPartitions(doStuff)
We use ExecutionContextTaskSupport to put all computations into dedicated ThreadPool instead of using default JVM-level ForkJoin pool.