1
votes

I have a Beam job running on Google Cloud DataFlow that reads data from BigQuery. When I run the job it takes minutes for the job to start reading data from the (tiny) table. It turns out the dataflow job sends of a BigQuery job which runs in BATCH mode and not in INTERACTIVE mode. How can I switch this to run immediately in Apache Beam? I couldn't find a method in the API to change the priority.

2

2 Answers

1
votes

Maybe a Googler will correct me, but no, you cannot change this from BATCH to INTERACTIVE because it's not exposed by Beam's API.

From org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.java (here):

private void executeQuery(
    String executingProject,
    String jobId,
    TableReference destinationTable,
    JobService jobService) throws IOException, InterruptedException {
  JobReference jobRef = new JobReference()
      .setProjectId(executingProject)
      .setJobId(jobId);

  JobConfigurationQuery queryConfig = createBasicQueryConfig()
      .setAllowLargeResults(true)
      .setCreateDisposition("CREATE_IF_NEEDED")
      .setDestinationTable(destinationTable)
      .setPriority("BATCH") <-- NOT EXPOSED
      .setWriteDisposition("WRITE_EMPTY");

  jobService.startQueryJob(jobRef, queryConfig);
  Job job = jobService.pollJob(jobRef, JOB_POLL_MAX_RETRIES);
  if (parseStatus(job) != Status.SUCCEEDED) {
    throw new IOException(String.format(
        "Query job %s failed, status: %s.", jobId, statusToPrettyString(job.getStatus())));
  }
}

If it's really a problem for you that the query is running in BATCH mode, then one workaround could be:

  1. Using the BigQuery API directly, roll your own initial request, and set the priority to INTERACTIVE.
  2. Write the results of step 1 to a temp table
  3. In your Beam pipeline, read the temp table using BigQueryIO.Read.from()
0
votes

You can configure to run the queries with "Interactive" priority by passing a priority parameter. Check this Github example for details.

Please note that you might be reaching one of the BigQuery limits and quotas as when you use batch, if you ever hit a rate limit, the query will be queued and retried later. As opposed to the interactive ones, when if these limits are hit, the query will fail immediately. This is because BigQuery assumes that an interactive query is something you need run immediately.