Google Cloud Dataflow: Accessing Google Cloud Pub/Sub in pipeline with DirectPipelineRunner (local job)?

Question

I have written a streaming pipeline using the Google Cloud Dataflow SDK, but I want to test my pipelines locally. My pipeline takes input data from Google Pub/Sub.

Is it possible to run jobs that access Pub/Sub (pubsubIO) using the DirectPipelineRunner (local execution, not in Google Cloud)?

I am running into permissions issues while logged in as my normal user account. I am the owner of the project with the pub/sub topic I am trying to access.

By "trying to access", what did you do exactly? Every operations should just work if you're an owner of the project. — Takashi Matsuo

Frances Frances · Accepted Answer · 2016-10-27T16:58:43

The InProcessPipelineRunner is a new version of the DirectPipelineRunner introduce in Dataflow SDK for Java 1.6.0 that includes support for unbounded PCollections.

(Note: In Apache Beam, this functionality has already been added to the DirectRunner, but in the Dataflow SDK for Java, we can't do that until 2.0 since its better checking of the model may cause additional test failures, which we consider a backwards incompatible change. Hence the addition of the companion InProcessPipelineRunner for the time being.)

There's also some great new support for testing late and out of order data.

Google Cloud Dataflow: Accessing Google Cloud Pub/Sub in pipeline with DirectPipelineRunner (local job)?

4 Answers