Why Google Cloud Dataflow doesn't scale to target workers with autoscaling enabled and no quota limitations

Question

I am running a Dataflow pipeline to parse about 45000 text files stored in a Cloud storage bucket. The parsed text is transformed into JSON and written to text files for subsequent loading in BigQuery (not part of the pipeline). A few minutes after the pipeline starts the number of target workers gets raised to > 30 (exact number varies slightly between runs), but the number of actual workers remains stuck at 1.

Things I have checked:

There are no quota limitations (checked through the console and the job logs)
Autoscaling is enabled
The load of the single worker is around 80 % (The Cloud Dataflow documentation mentions that Autoscaling is disabled if the load of a single worker is under 5 %)

If I let the pipeline run it finishes successfully in about 2 hours, but I expect that this could run much faster if the actual workers would scale to the target.

Here is the relevant section of the code:

client = storage.Client()
blobs = client.list_blobs(bucket_name)

rf = [b.name for b in blobs]

with beam.Pipeline(options=pipeline_options) as p:
    json_list = (p | 'Create filelist' >> beam.Create(rf)
                   | 'Get string' >> beam.Map(getstring)
                   | 'Filter empty strings' >> beam.Filter(lambda x: x != "")
                   | 'Get JSON' >> beam.Map(getjson)
                   | 'Write output' >> WriteToText(known_args.output))

Any suggestion as to what is preventing the workers from scaling up ?

danielm danielm · Accepted Answer · 2021-02-04T21:45:28

The issue here is that there is no parallelism available in this pipeline. The Create transform is single-sharded, and everything else in the pipeline is being fused together with that. Using one of the built in file reading transforms like ReadFromText will solve this, or you can put a Reshuffle transform after the Create in order to break the pipeline into two separate stages.

Why Google Cloud Dataflow doesn't scale to target workers with autoscaling enabled and no quota limitations

1 Answers