10
votes

We are building rather complex Dataflow jobs in that compute models from a streaming source. In particular, we have two models that share a bunch of metrics and that are computed off roughly the same data source. The jobs perform joins on slightly large datasets.

Do you have any guidelines on how to design that kind of jobs? Any metrics, behaviors, or anything we have to consider in oder to make the decision?

Here are a couple options that we have in mind and how we thing they compare:

Option 1: one large job

Implement everything in one, large job. Factor common metrics, and then compute model specific metrics.

Pros

  • Simpler to write.
  • No dependency between jobs.
  • Less compute resources?

Cons

  • If one part breaks, both models can't be computed.

Large job

Option 2: Multiple jobs piped with Pub/Sub

Extract out the common metrics computation to a dedicated job, thus resulting in 3 jobs, wired together using Pub/Sub.

Pros

  • More resilient in case of failure of one of the model job.
  • Probably easier to perform ongoing updates.

Cons

  • All jobs need to be started in order to have the full pipeline: dependency management.

3 jobs

1

1 Answers

6
votes

You've already mentioned many of the key tradeoffs here -- modularity and smaller failure domains vs. operational overhead and the potential complexity of a monolithic system. Another point to be aware of is cost -- the Pub/Sub traffic will increase the price of the multiple pipelines solution.

Without knowing the specifics of your operation better, my advice would be to go with option #2. It sounds like there is at least partial value in having a subset of the models up, and in the event of a critical bug or regression, you'll be able to make partial progress while looking for a fix.