0
votes

I have n sources that a job depends on. Each source has a separate topic in Google PubSub; when a source is updated it sends a message in the corresponding topic subscription. When all sources are updated (i.e. when there is at least one new message in each subscription) the job can start. The job is scheduled with airflow. The DAG starts with a series of parallel tasks one for each subscription that check if a new message has been published, but without aknowledging it. The next task waits for all the previous ones and uses XCOM to see if all contains a message. In that case it proceeds with the job (it first aknowledge the messages), otherwise it stops. In this way I acknowledge the messages only when they are all available, using PubSub as a coordinator. The messages frequency is once or twice a day at most.

Basically I'm using PubSub as way to keep "state". Suppose I have different jobs that depend on the same source. I can create a subscription for the same topic on each job and it all works fine.

Is there a better way/tool/framework to do this?

1

1 Answers

0
votes

According with the volume of message that you have, and from my previous implementations, I can recommend you to persist states in Firestore: serverless, affordable, fast...

When a message is published, trigger a function that persist state in Firestore

Then, trigger the number of processes that you want, query Firestone to check if all the states are OK, and continue or stop.

It's my pattern for synchronization. Not that the best!


Anyway, if you create a subscription per process, it also works. The message are duplicated in each subscription and thus you can process them independently.