Architecture of complex Dataflow jobs

Question

We are building rather complex Dataflow jobs in that compute models from a streaming source. In particular, we have two models that share a bunch of metrics and that are computed off roughly the same data source. The jobs perform joins on slightly large datasets.

Do you have any guidelines on how to design that kind of jobs? Any metrics, behaviors, or anything we have to consider in oder to make the decision?

Here are a couple options that we have in mind and how we thing they compare:

Option 1: one large job

Implement everything in one, large job. Factor common metrics, and then compute model specific metrics.

Pros

Simpler to write.
No dependency between jobs.
Less compute resources?

Cons

If one part breaks, both models can't be computed.

Option 2: Multiple jobs piped with Pub/Sub

Extract out the common metrics computation to a dedicated job, thus resulting in 3 jobs, wired together using Pub/Sub.

Pros

More resilient in case of failure of one of the model job.
Probably easier to perform ongoing updates.

Cons

All jobs need to be started in order to have the full pipeline: dependency management.

Sam McVeety Sam McVeety · Accepted Answer · 2016-04-12T16:11:45

You've already mentioned many of the key tradeoffs here -- modularity and smaller failure domains vs. operational overhead and the potential complexity of a monolithic system. Another point to be aware of is cost -- the Pub/Sub traffic will increase the price of the multiple pipelines solution.

Without knowing the specifics of your operation better, my advice would be to go with option #2. It sounds like there is at least partial value in having a subset of the models up, and in the event of a critical bug or regression, you'll be able to make partial progress while looking for a fix.

Architecture of complex Dataflow jobs

Option 1: one large job

Pros

Cons

Option 2: Multiple jobs piped with Pub/Sub

Pros

Cons

1 Answers