I've a Cloud Dataflow pipeline that looks like this:
- Read From Cloud Bigtable
- Do some transformation
- Write to GCS
Initially without setting any max workers, dataflow autoscaling would scale to max(1000 nodes) and would put a LOT of stress on our Bigtable cluster. Then I specified some maxNumWorkers to say 100 and it's fine and doesn't put any crazy load on our Bigtable cluster and Stage 1 usually finishes quickly (reading from Bigtable); but Step 2 and 3 with only 100 nodes take significantly longer. Is there anyway I can change maxNumWorkers dynamically after the first stage? I see apply(Wait.on) but not sure how to utilize it. My beam job looks like this:
pipeline
.apply("Read Bigtable",...)
.apply("Transform", ...)
.apply("Partition&Write", ...)
I am looking for a way to wait for .apply("Read Bigtable",...) to finish then increase maxNumWorkers. Essentially, my first stage is IO bound and I don't need CPU (workers) but my later stages are CPU bound and I need more CPU (workers).