0
votes

Currently utilising the Google Dataflow with Python for batch processing. This works fine, however, I'm interested in getting a bit more speed out of my Dataflow Jobs without having to deal with Java.

Using the Go SDK, I've implemented a simple pipeline that reads a series 100-500mb files from Google Storage (using textio.Read), does some aggregation and updates CloudSQL with the results. The number of files being read can range from dozens to hundreds.

When I run the pipeline, I can see from the logs that files are being read serially, instead of in parallel, as a result the job takes much longer. The same process executed with the Python SDK triggers autoscaling and runs multiple reads within minutes.

I've tried specifying the number of workers using --num_workers=, however, Dataflow scales the job down to one instance after a few minutes and from the logs no parallel reads take place in the time the instance was running.

Something similar happens if I remove the textio.Read and implement a custom DoFn for reading from GCS. The read process is still run serially.

I'm aware the current Go SDK is experimental and lacks many features, however, I haven't found a direct reference to limitations with Parallel processing, here. Does the current incarnation of the Go SDK support parallel processing on Dataflow?

Thanks in advance

2

2 Answers

4
votes

Managed to find an answer for this after actually creating my own IO package for the Go SDK.

SplitableDoFns are not yet available in the Go SDK. This key bit of functionality is what allows the Python and Java SDKs to perform IO operations in parallel and thus, much faster than the Go SDK at scale.