I'm curious about the best way to ensure idempotence when using Cloud DataFlow and PubSub?
We currently have a system which processes and stores records in a MySQL database. I'm curious about using DataFlow for some of our reporting, but wanted to understand what I would need to do to ensure that I didn't accidentally double count (or more than double count) the same messages.
My confusion comes in two parts, firstly ensuring I only send the messages once and secondly ensuring I process them only once.
My gut would be as follows:
Whenever an event I'm interested in is recorded in our MySQL database, transform it into a PubSub message and publish it to PubSub. Assuming success, record the PubSub id that's returned alongside the MySQL record. That way, if it has a PubSub id, I know I've sent it and I don't need to send it again. If the publish to PubSub fails, then I know I need to send it again. All good.
But if the write to MySQL fails after the PubSub write succeeds, I might end up publishing the same message to pub sub again, so I need something on the DataFlow side to handle both this case and the case that PubSub sends a message twice (as per https://cloud.google.com/pubsub/subscriber#guarantees).
What's the best way to handle this? In AppEngine or other systems I would have a check against the datastore to see if the new record I'm creating exists, but I'm not sure how you'd do that with DataFlow. Is there a way I can easily implement a filter to stop a message being processed twice? Or does DataFlow handle this already?