BigQuery read is Slow in Google DataFlow pipeline

Question

For our Near real time analytics, data will be streamed into pubsub and Apache beam dataflow pipeline will process by first writing into bigquery and then do the aggregate processing by reading again from bigquery then storing the aggregated results in Hbase for OLAP cube Computation.

Here is the sample ParDo function which is used to fetch record from bigquery

String eventInsertedQuery="Select count(*) as usercount from <tablename> where <condition>";
BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();
QueryJobConfiguration queryConfig
=QueryJobConfiguration.newBuilder(eventInsertedQuery).build();
TableResult result =  bigquery.query(queryConfig);
FieldValueList row = result.getValues().iterator().next();
LOG.info("rowCounttt {}",row.get("usercount").getStringValue());

bigquery.query is taking aroud ~4 seconds. Any suggestions to improve it? Since this is near real time analytics this time duration is not acceptable.

It sounds like you are using a custom ParDo to make an RPC to bigquery and pull in data, rather than using the built in BigQueryIO.Read, is there a reason why you are doing that? With that approach it will take a few secord to make the RPC, if you are doing this the process function this is called for every element, and will make the entire pipeline very slow. Is it possible to use BigQueryIO.Read instead which is optimized to pull in rows from batch and then parallelize the processing in the pipeline? — Alex Amato
You can either read in the whole table or provide a custom query to BigQueryIO.Read. Then performa computation and aggregation in the Dataflow pipeline based on the elements that are output from BigQueryIO.Read beam.apache.org/releases/javadoc/2.2.0/org/apache/beam/sdk/io/… — Alex Amato
I recommend creating a pipeline where you read from two sources, PubSub and BigQueryIO.Read. Then perform a Window to time bound the data (you can't join unboudned/infinite streaming data) the data and perform a CoGroupByKey to join the pubsub elements with bigquery rows using a key. After joining the data and computing what you want, output that to a Sink, perhaps using BigQueryIO.Write. See Windowing, CoGroupByKey, etc. beam.apache.org/documentation/programming-guide — Alex Amato
@AlexAmato - I think the OP is ingesting data from Pub/Sub and writing it to BQ in the raw format in Step 1. Then in Step 2 reading from the BQ table perform some aggregates and write it into Bigtable. Based on this understanding they can read from PubSub write the raw data to BQ and then on the same data perform windowing and aggregation and write it into Bigtable without the need for reading from BQ. — Jayadeep Jayaraman
Alternatively, depending on how much data you are looking up for each key in the BigQuery table. You could compute a side input, with the data for each key. Then in a ParDo which received the PubSub events as input and the side input (with the derived BigQuery data) you can lookup the side input data efficiently from data cached in memory. See Side Inputs: beam.apache.org/documentation/programming-guide/#side-inputs and look for some examples on pvalue.AsDict pvalue.AsSingleton pvalue.AsList — Alex Amato

rsantiago rsantiago · Accepted Answer · 2020-03-24T16:42:30

Frequent reads from BigQuery can add undesired latency in your app. If we consider that BigQuery is a data warehouse for Analytics, I would think that 4 seconds is a good response time. I would suggest to optimize the query to reduce the 4 seconds threshold.

Following is a list of possibilities you can opt to:

Optimizing the query statement, including changing the Database schema to add partitioning or clustering.
Using a relational database provided by Cloud SQL for getting better response times.
Changing the architecture of you app. As recommended in comments, it is a good option to transform the data before writing to BQ, so you can avoid the latency of querying the data twice. There are several articles to perform Near Real Time computation with Dataflow (e.g. building real time app and real time aggregate data).

On the other hand, keep in mind that the time to finish a query is not included in the BigQuery SLAs webpage, in fact, it is expected that errors can occur and consume even more time to finish a query, see Back-off Requirements in the same link.

BigQuery read is Slow in Google DataFlow pipeline

1 Answers