4
votes

I'm curious about the best way to ensure idempotence when using Cloud DataFlow and PubSub?

We currently have a system which processes and stores records in a MySQL database. I'm curious about using DataFlow for some of our reporting, but wanted to understand what I would need to do to ensure that I didn't accidentally double count (or more than double count) the same messages.

My confusion comes in two parts, firstly ensuring I only send the messages once and secondly ensuring I process them only once.

My gut would be as follows:

Whenever an event I'm interested in is recorded in our MySQL database, transform it into a PubSub message and publish it to PubSub. Assuming success, record the PubSub id that's returned alongside the MySQL record. That way, if it has a PubSub id, I know I've sent it and I don't need to send it again. If the publish to PubSub fails, then I know I need to send it again. All good.

But if the write to MySQL fails after the PubSub write succeeds, I might end up publishing the same message to pub sub again, so I need something on the DataFlow side to handle both this case and the case that PubSub sends a message twice (as per https://cloud.google.com/pubsub/subscriber#guarantees).

What's the best way to handle this? In AppEngine or other systems I would have a check against the datastore to see if the new record I'm creating exists, but I'm not sure how you'd do that with DataFlow. Is there a way I can easily implement a filter to stop a message being processed twice? Or does DataFlow handle this already?

1

1 Answers

6
votes

Dataflow can de-duplicate messages based on an arbitrarily message attribute (selected by idLabel) on the receiver side, as outlined in Using Record IDs. From the producer side, you'll want to make sure that you are deterministically and uniquely populating the attribute based on the MySQL record. If this is done correctly, Dataflow will process each logical record exactly once.