2
votes

I've a Cloud Dataflow pipeline that looks like this:

  1. Read From Cloud Bigtable
  2. Do some transformation
  3. Write to GCS

Initially without setting any max workers, dataflow autoscaling would scale to max(1000 nodes) and would put a LOT of stress on our Bigtable cluster. Then I specified some maxNumWorkers to say 100 and it's fine and doesn't put any crazy load on our Bigtable cluster and Stage 1 usually finishes quickly (reading from Bigtable); but Step 2 and 3 with only 100 nodes take significantly longer. Is there anyway I can change maxNumWorkers dynamically after the first stage? I see apply(Wait.on) but not sure how to utilize it. My beam job looks like this:

pipeline
 .apply("Read Bigtable",...)
 .apply("Transform", ...)
 .apply("Partition&Write", ...)

I am looking for a way to wait for .apply("Read Bigtable",...) to finish then increase maxNumWorkers. Essentially, my first stage is IO bound and I don't need CPU (workers) but my later stages are CPU bound and I need more CPU (workers).

1
Good question, I would also like to know if task-level control over workers is possiblemanesioz

1 Answers

-1
votes

Did you try using file sharding to control the parallelism:

1) keep maxWorker to be 1000 2) right after reading from bigtable, save the data to a sharding of 100 3) Load the data again, do processing, but for final results, we write to a sharding of 1000.

I cannot guarantee, but it worth a try.