Pub/Sub to Datastore Batch Dataflow Job in Apache Beam Python SDK Possible?

Question

I have a Pub/Sub topic which will periodically (usually once every few days or weeks, but sometimes more often) receive batches of messages. I'd like to kick off a batch Dataflow job to read these messages, perform some transformations, write the result to Datastore, then stop running. When a new batch of messages goes out, I want to kick off a new job. I've read over the Apache Beam Python SDK docs and many SO questions, but am still uncertain about a few things.

Can Pub/Sub IO read as part of a non-streaming job? Can the same job then write using Datastore IO (which doesn't currently support streaming)? Can I assume the default global window and trigger will correctly tell the job when to stop reading from Pub/Sub (when the batch of messages is no longer being written)? Or do I need to add some sort of trigger/windowing scheme like max time or max number of elements? Would that trigger, when triggered, tell the global window to close and therefore end the job?

Daniel Collins Daniel Collins · Accepted Answer · 2019-04-22T23:00:21

Edit: Incorrectly answered assuming this was for Java Beam with Dataflow.

Apologies, I missed that this was for Python.

Per the documentation here added in this pull request, Datastore is explicitly not supported in streaming mode in Python. There is an inconsistency in the documentation, where it claims that Python batch mode for Pub/Sub is supported, whereas the linked code says that it is only supported in streaming pipelines. I have filed a Jira bug to try to get this resolved.

This appears to not be a currently supported use case for Dataflow in python streaming mode. I would suggest that you consider using the Java version of Apache Beam instead, which supports streaming writes into Datastore.

Pub/Sub to Datastore Batch Dataflow Job in Apache Beam Python SDK Possible?

1 Answers