Change Google Cloud Dataflow BigQuery Priority

Question

I have a Beam job running on Google Cloud DataFlow that reads data from BigQuery. When I run the job it takes minutes for the job to start reading data from the (tiny) table. It turns out the dataflow job sends of a BigQuery job which runs in BATCH mode and not in INTERACTIVE mode. How can I switch this to run immediately in Apache Beam? I couldn't find a method in the API to change the priority.

Graham Polley Graham Polley · Accepted Answer · 2017-05-28T01:53:48

Maybe a Googler will correct me, but no, you cannot change this from BATCH to INTERACTIVE because it's not exposed by Beam's API.

From org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.java (here):

private void executeQuery(
    String executingProject,
    String jobId,
    TableReference destinationTable,
    JobService jobService) throws IOException, InterruptedException {
  JobReference jobRef = new JobReference()
      .setProjectId(executingProject)
      .setJobId(jobId);

  JobConfigurationQuery queryConfig = createBasicQueryConfig()
      .setAllowLargeResults(true)
      .setCreateDisposition("CREATE_IF_NEEDED")
      .setDestinationTable(destinationTable)
      .setPriority("BATCH") <-- NOT EXPOSED
      .setWriteDisposition("WRITE_EMPTY");

  jobService.startQueryJob(jobRef, queryConfig);
  Job job = jobService.pollJob(jobRef, JOB_POLL_MAX_RETRIES);
  if (parseStatus(job) != Status.SUCCEEDED) {
    throw new IOException(String.format(
        "Query job %s failed, status: %s.", jobId, statusToPrettyString(job.getStatus())));
  }
}

If it's really a problem for you that the query is running in BATCH mode, then one workaround could be:

Using the BigQuery API directly, roll your own initial request, and set the priority to INTERACTIVE.
Write the results of step 1 to a temp table
In your Beam pipeline, read the temp table using BigQueryIO.Read.from()

Change Google Cloud Dataflow BigQuery Priority

2 Answers