I am running a Dataflow pipeline to parse about 45000 text files stored in a Cloud storage bucket. The parsed text is transformed into JSON and written to text files for subsequent loading in BigQuery (not part of the pipeline). A few minutes after the pipeline starts the number of target workers gets raised to > 30 (exact number varies slightly between runs), but the number of actual workers remains stuck at 1.
Things I have checked:
- There are no quota limitations (checked through the console and the job logs)
- Autoscaling is enabled
- The load of the single worker is around 80 % (The Cloud Dataflow documentation mentions that Autoscaling is disabled if the load of a single worker is under 5 %)
If I let the pipeline run it finishes successfully in about 2 hours, but I expect that this could run much faster if the actual workers would scale to the target.
Here is the relevant section of the code:
client = storage.Client()
blobs = client.list_blobs(bucket_name)
rf = [b.name for b in blobs]
with beam.Pipeline(options=pipeline_options) as p:
json_list = (p | 'Create filelist' >> beam.Create(rf)
| 'Get string' >> beam.Map(getstring)
| 'Filter empty strings' >> beam.Filter(lambda x: x != "")
| 'Get JSON' >> beam.Map(getjson)
| 'Write output' >> WriteToText(known_args.output))
Any suggestion as to what is preventing the workers from scaling up ?