What I'm doing: I'm building a system in which one Cloud Pub/Sub topic will be read by dozens of Apache Beam pipelines in streaming mode. Each time I deploy a new pipeline, it should first process several years of historic data (stored in BigQuery).
The problem: If I replay historic data into the topic whenever I deploy a new pipeline (as suggested here), it will also be delivered to every other pipeline currently reading the topic, which would be wasteful and very costly. I can't use Cloud Pub/Sub Seek (as suggested here) as it stores a maximum of 7 days history (more details here).
The question: What is the recommended pattern to replay historic data into new Apache Beam streaming pipelines with minimal overhead (and without causing event time/watermark issues)?
Current ideas: I can currently think of three approaches to solving the problem, however, none of them seem very elegant and I have not seen any of them mentioned in the documentation, common patterns (part 1 or part 2) or elsewhere. They are:
Ideally, I could use Flatten to merge the real-time
ReadFromPubSub
with a one-offBigQuerySource
, however, I see three potential issues: a) I can't account for data that has already been published to Pub/Sub, but hasn't yet made it into BigQuery, b) I am not sure whether theBigQuerySource
might inadvertently be rerun if the pipeline is restarted, and c) I am unsure whetherBigQuerySource
works in streaming mode (per the table here).I create a separate replay topic for each pipeline and then use Flatten to merge the
ReadFromPubSub
s for the main topic and the pipeline-specific replay topic. After deployment of the pipeline, I replay historic data to the pipeline-specific replay topic.I create dedicated topics for each pipeline and deploy a separate pipeline that reads the main topic and broadcasts messages to the pipeline-specific topics. Whenever a replay is needed, I can replay data into the pipeline-specific topic.