We are building rather complex Dataflow jobs in that compute models from a streaming source. In particular, we have two models that share a bunch of metrics and that are computed off roughly the same data source. The jobs perform joins on slightly large datasets.
Do you have any guidelines on how to design that kind of jobs? Any metrics, behaviors, or anything we have to consider in oder to make the decision?
Here are a couple options that we have in mind and how we thing they compare:
Option 1: one large job
Implement everything in one, large job. Factor common metrics, and then compute model specific metrics.
Pros
- Simpler to write.
- No dependency between jobs.
- Less compute resources?
Cons
- If one part breaks, both models can't be computed.
Option 2: Multiple jobs piped with Pub/Sub
Extract out the common metrics computation to a dedicated job, thus resulting in 3 jobs, wired together using Pub/Sub.
Pros
- More resilient in case of failure of one of the model job.
- Probably easier to perform ongoing updates.
Cons
- All jobs need to be started in order to have the full pipeline: dependency management.