0
votes

I'm trying to run a pipeline in Google Cloud DataFlow, in "Streaming" mode. The pipeline should read from a PubSub topic, however it doesn't actually read from the topic until I delete it, re-create it and re-publish all the messages to the topic AFTER the pipeline started.

Is there any way to make the pipeline read already-published messages?

2

2 Answers

1
votes

It sounds like supplying a Pub/Sub subscription (more details in the Pub/Sub I/O documentation) would solve your problem. Messages will be buffered after the subscription creation, allowing these to be read when the pipeline starts.

1
votes

Please create a custom subscription in pub sub using cloud console. In the code try something like this.

 PCollection<TableRow> datastream = p.apply(PubsubIO.Read.named("Read device iot data from PubSub")

            .subscription(String.format("projects/%s/subscriptions/%s",<ProjectId>,<Subscriptionname>))

            .timestampLabel("ts")
            .withCoder(TableRowJsonCoder.of()));

Please note when you subscribe , you can subscribe to either a topic or subscription name.

In the above code i subscribed to subscription which i created explicitly in pub sub console. The advantage of going for explicit subscription is that, it stores data pulled from pub sub even when your data flow code is offline.So data wont be lost.