Equivalent of setNumberOfWorkerHarnessThreads on Dataflow for Spark?

Question

I have a heavily I/O bound (Java) beam pipeline that on Google Cloud Dataflow I use the dataflow beam option "setNumberOfWorkerHarnessThreads(16);" to get 16 threads running on every virtual CPU. I'm trying to port that same pipeline to run on Spark, and I can't find an equivalent option on Spark. I've tried doing my own threading but that appears to be causing problems on the SparkRunner since the ProcessElement part of the DoFn returns but the output to the ProcessContext gets called later when the thread completes. (I get weird ConcurrentModificationExceptions with stack traces that are part of beam rather than in user code)

Is there an equivalent to that setting on Spark?

robertwb robertwb · Accepted Answer · 2021-05-08T00:52:14

I'm not aware of an equivalent setting on Spark, but if you want to do your own threading you'll have to ensure that calling the output is only ever done in the same thread that invokes ProcessElement or FinishBundle. You can do this by starting a threadpool that reads from a queue and writes to a queue, and in your ProcessElement you can push to the one queue and drain the other to the context's output, and also drain in FinishBundle.

Equivalent of setNumberOfWorkerHarnessThreads on Dataflow for Spark?

1 Answers