Currently utilising the Google Dataflow with Python for batch processing. This works fine, however, I'm interested in getting a bit more speed out of my Dataflow Jobs without having to deal with Java.
Using the Go SDK, I've implemented a simple pipeline that reads a series 100-500mb files from Google Storage (using textio.Read
), does some aggregation and updates CloudSQL with the results. The number of files being read can range from dozens to hundreds.
When I run the pipeline, I can see from the logs that files are being read serially, instead of in parallel, as a result the job takes much longer. The same process executed with the Python SDK triggers autoscaling and runs multiple reads within minutes.
I've tried specifying the number of workers using --num_workers=
, however, Dataflow scales the job down to one instance after a few minutes and from the logs no parallel reads take place in the time the instance was running.
Something similar happens if I remove the textio.Read
and implement a custom DoFn for reading from GCS. The read process is still run serially.
I'm aware the current Go SDK is experimental and lacks many features, however, I haven't found a direct reference to limitations with Parallel processing, here. Does the current incarnation of the Go SDK support parallel processing on Dataflow?
Thanks in advance