Spark foreachPartition run only on master

Question

I have a DataProc cluster with one master and 4 workers. I have this spark job:

JavaRDD<Signal> rdd_data = javaSparkContext.parallelize(my_data, 8);

rdd_data.foreachPartition(partitionOfRecords -> {
    println("Items in partition-" + partitionOfRecords.count(y=>true));
    })

Where my_data is an array with about 1000 elements. The job on cluster starts in the right way and return correct data, but it runs only on master and not on workers. I use dataproc image 1.4 for every machine in the cluster

Anybody can help me to understand why this job runs only on master?

I'm not able to easily reproduce this behavior. How did you determine that the tasks ran on the master? — Ben Sidhom
Would also be good to clarify whether this was submitting through the Dataproc jobs API or over a command-line call to spark-submit, and whether any extra Spark properties were specified (such as --master local[1], which would make the job use Spark's "local executor" instead of the actual cluster). Also, would be important to know the machine type. For example, if the concern was actually just that only one node was used, but not actually the "master", it could be for example due to running on n1-standard-8 workers, where each worker would have 8 tasks, and all 8 partitions ran on one node — Dennis Huo
If there we indeed no additional spark properties added, then I'd suspect the job should just increase the number of partitions. — Dennis Huo
Sorry guys, It was my fault! A wrong Ctrl-C Ctrl-V set master local[1]. Now it works correctly. Thanks for your time! — Claudio

abiratsis abiratsis · Accepted Answer · 2019-06-22T09:13:46

There two points of interest here:

The line println("Items in partition-" + partitionOfRecords.count(y=>true)); will print the expected results only in the case that the executor is the same node as the client running the Spark program. This happens since the println command uses stdout stream under the hood which is accessible only on the same machine hence messages from different nodes can not be propagated to the client program.
When you set master to local[1] you force Spark to run locally using one thread therefore Spark and the client program are using the same stdout stream and you are able to see the program output. That also means that driver and executor is the same node.

Spark foreachPartition run only on master

2 Answers