1
votes

What I'm doing: I'm building a system in which one Cloud Pub/Sub topic will be read by dozens of Apache Beam pipelines in streaming mode. Each time I deploy a new pipeline, it should first process several years of historic data (stored in BigQuery).

The problem: If I replay historic data into the topic whenever I deploy a new pipeline (as suggested here), it will also be delivered to every other pipeline currently reading the topic, which would be wasteful and very costly. I can't use Cloud Pub/Sub Seek (as suggested here) as it stores a maximum of 7 days history (more details here).

The question: What is the recommended pattern to replay historic data into new Apache Beam streaming pipelines with minimal overhead (and without causing event time/watermark issues)?

Current ideas: I can currently think of three approaches to solving the problem, however, none of them seem very elegant and I have not seen any of them mentioned in the documentation, common patterns (part 1 or part 2) or elsewhere. They are:

  1. Ideally, I could use Flatten to merge the real-time ReadFromPubSub with a one-off BigQuerySource, however, I see three potential issues: a) I can't account for data that has already been published to Pub/Sub, but hasn't yet made it into BigQuery, b) I am not sure whether the BigQuerySource might inadvertently be rerun if the pipeline is restarted, and c) I am unsure whether BigQuerySource works in streaming mode (per the table here).

  2. I create a separate replay topic for each pipeline and then use Flatten to merge the ReadFromPubSubs for the main topic and the pipeline-specific replay topic. After deployment of the pipeline, I replay historic data to the pipeline-specific replay topic.

  3. I create dedicated topics for each pipeline and deploy a separate pipeline that reads the main topic and broadcasts messages to the pipeline-specific topics. Whenever a replay is needed, I can replay data into the pipeline-specific topic.

1

1 Answers

1
votes

Out of your three ideas:

  • The first one will not work because currently the Python SDK does not support unbounded reads from bounded sources (meaning that you can't add a ReadFromBigQuery to a streaming pipeline).

  • The third one sounds overly complicated, and maybe costly.

I believe your best bet at the moment is as you said, to replay your table into an extra PubSub topic that you Flatten with your main topic, as you rightly pointed out.

I will check if there's a better solution, but for now, option #2 should do the trick.


Also, I'd refer you to an interesting talk from Lyft on doing this for their architecture (in Flink).