0
votes

In our application we are reading data from PubSub, using PubsubIO, in dataflow pipeline. Below is the code.

  PCollection<String> pubsubMsg = pipeline
                .apply(PubsubIO.readStrings().fromSubscription(options.getInputSubscription1()))
    PCollection<String> groupByBigqueryResult = pubsubMsg.apply("Read from bigquery table",
            ParDo.of(new ReadRawdataFromBiqueryTable()));           

But when we attach the BigQuery read in this pipeline, since BigQuery read is slow as we are doing it in ParDo, seems there is some default flow control settings implemented in PubSub Subscriber, hence i can see the PubSub message flow rate is very slow, but if i comment out the BigQuery read implementation done in ReadRawdataFromBiqueryTable, then it is fast. How to override the flow control settings. Attached both the dataflow jobs 1.with bigquery read PubSub flow rate is slow due to slow bigquery read2. commented out bigquery read part Big Query Read Commented out

1

1 Answers

1
votes

I suspect what is happening here is that the pipeline being slow due to the slow ParDo, not explicit flow control from Dataflow or PubSubIO. Dataflow (and Beam in general) reads data in a pipeline and each data element is passed through this pipeline (sometimes buffered). So in this case, PubSubIO will not read data if immediate next step (ParDo that reads from BigQuery, is slow). I suggest reading following to learn more about the Beam programming model.

https://beam.apache.org/documentation/programming-guide/

One way to speed up might be to try to reduce the amount of data read from BigQuery by buffering multiple elements and somehow trying to reduce number of requests sent to BigQuery (or re-structuring the pipeline in some other way).