2
votes

What happens if I use scala parallel collections within a spark job? (which typically spawns jobs to process partitions of the collections on multiple threads). Or for that matter an job that potentially starts sub threads?

Does spark's JVM limit execution to a single core or can it sensibly distribute the work across many cores (presumably on the same node?)

1

1 Answers

6
votes

We use scala parallel collections extensively in Spark rdd.mapPartitions(...) function. It works perfectly for us, we are able so scale IO intensive jobs very well (calling Redis/HBase/etc...)

BIG WARN: Scala parallel collections are not lazy! when you construct par-iterator it actually brings all rows from Iterator[Row] into memory. We use it mostly in Spark-Streaming context, so it's not an issue for us. But it's a problem when we want for example to process huge HBase table with Spark

private def doStuff(rows: Iterator[Row]): Iterator[Row] = {
    val pit = rows.toIterable.par
    pit.tasksupport = new ExecutionContextTaskSupport(ExecutionContext.fromExecutor(....)
    pit.map(row => transform(row)).toIterator
}

rdd.mapPartitions(doStuff)

We use ExecutionContextTaskSupport to put all computations into dedicated ThreadPool instead of using default JVM-level ForkJoin pool.