1
votes

I am running a Dataflow pipeline to parse about 45000 text files stored in a Cloud storage bucket. The parsed text is transformed into JSON and written to text files for subsequent loading in BigQuery (not part of the pipeline). A few minutes after the pipeline starts the number of target workers gets raised to > 30 (exact number varies slightly between runs), but the number of actual workers remains stuck at 1.

Things I have checked:

  • There are no quota limitations (checked through the console and the job logs)
  • Autoscaling is enabled
  • The load of the single worker is around 80 % (The Cloud Dataflow documentation mentions that Autoscaling is disabled if the load of a single worker is under 5 %)

If I let the pipeline run it finishes successfully in about 2 hours, but I expect that this could run much faster if the actual workers would scale to the target.

Here is the relevant section of the code:

client = storage.Client()
blobs = client.list_blobs(bucket_name)

rf = [b.name for b in blobs]

with beam.Pipeline(options=pipeline_options) as p:
    json_list = (p | 'Create filelist' >> beam.Create(rf)
                   | 'Get string' >> beam.Map(getstring)
                   | 'Filter empty strings' >> beam.Filter(lambda x: x != "")
                   | 'Get JSON' >> beam.Map(getjson)
                   | 'Write output' >> WriteToText(known_args.output))

Any suggestion as to what is preventing the workers from scaling up ?

1

1 Answers

1
votes

The issue here is that there is no parallelism available in this pipeline. The Create transform is single-sharded, and everything else in the pipeline is being fused together with that. Using one of the built in file reading transforms like ReadFromText will solve this, or you can put a Reshuffle transform after the Create in order to break the pipeline into two separate stages.