0
votes

We are running our program inside a kubernetes pod, which is listening to pubsub message. Based on message data type it launches dataflow job. And once job execution finishes, we again send pubsub message to another system.

Pipeline is launched in batch mode and it read from GCS and after processing write to GCS.

Pipeline pipeline = Pipeline.create(options);
PCollection<String> read = pipeline
                .apply("Read from GCS",
                        TextIO.read().from("GCS_PATH").withCompression(Compression.GZIP));

//process 
// write to GCS
....
PipelineResult result = pipeline.run();
result.waitUntilFinish();

# send job completed message to Pubsub to other component
....
....

As I have to send event to other components in the system. As of now I am using Pubsbub java client library to push message to pubsub.

Is there a way, I can use apache Pubsub connector to send message like below - Or what is the right way to do the same

PubsubIO.writeMessages().to("topicName");
1
Why are you writing the output to GCS, what is the issue with writing the output to pubsub directly.Jayadeep Jayaraman
We write to our processed data to datalake. And once the batch job finishes, we notify other component for further analysis using pubsub. So pubsub is getting used only for sending start and end information about dataflow jobAditya

1 Answers

1
votes

To solve this usecase you can use the Wait API. Details can be found here

 PCollection<Void> firstWriteResults = data.apply(ParDo.of(...write to first database...));
 data.apply(Wait.on(firstWriteResults))
     // Windows of this intermediate PCollection will be processed no earlier than when
     // the respective window of firstWriteResults closes.
     .apply(ParDo.of(...write to second database...));