2
votes

I am using App Engine Standard Environment (autoscaled), which means I have a limit of 10 mins before a request is cancelled.

Goal is to query data from BigQuery in regular intervals and, for each record, create a task in the task queue, so that records can be processed in the background.

Instructions at https://cloud.google.com/bigquery/create-simple-app-api state to wait for a job like this:

// Create a job ID so that we can safely retry.
JobId jobId = JobId.of(UUID.randomUUID().toString());
Job queryJob = bigquery.create(JobInfo.newBuilder(queryConfig).setJobId(jobId).build());

// Wait for the query to complete.
queryJob = queryJob.waitFor();

Problem is the 10-minute limit, as BigQuery queries are processed in the background and it may take some time until the result becomes available, so I may not be able to process the response in the same endpoint call.

  • Is there a way to receive a callback from BigQuery at a URL, when a query is ready?
  • Is there a smarter way to process data from BigQuery in App Engine Standard?

I know I can configure App Engine to extend the maximum time per request, but that can hardly be the solution.

2
There's no way to get a call back from BigQuery unfortunately. What are you doing with the results of the query/job? i.e. can you explain a little more why you need to wait for the job to finish.Graham Polley
@GrahamPolley Our servers stream usage data into BigQuery (compares to phone call minutes), the data is then consolidated using a query (e.g. by participant). I was planning to create a task in the task queue for each such consolidated record, so that it can be further processed. It's quite similar to Google's App Engine billing system, where data is collected from servers, consolidated, and then being decided in about 5-min intervals if a given resource is available or not.Oliver Hausler
So, in essence you need to schedule some BigQuery jobs/queries that crunch some data, and write the results to a new table. Is that right?Graham Polley
Yes, that's correct, and so far this also works (I have aggregation and de-duplication coded). Problem is that data needs to be further processed in App Engine Standard environment. Pulling data from tables and marking records as pulled is not a good option I guess. I miss a hook or something that calls a URL every time a record is created, so App Engine can process the data further record by record.Oliver Hausler
Not quite sure what you mean by 10 mins limit. GAE request handlers have only 60s to return a response otherwise a 500 is returned. Task Schedulers on the other hand have 10 mins (if you choose manual autoscaling) or up to 24 hours (basic and automatic scaling). This means you can use a request handler to add a scheduled task to run in background where you run your BQ queries.Willian Fuks

2 Answers

1
votes

The best option is handling long-running tasks as BQ does: provide a job-id and allow clients to query it, returning a 202 while the query has not finished, and a 200 with the result once the result is ready to be consumed by the client.

Furthermore, the 202 can return a body, so you can set different status to the clients (e.g. "Queued", "Running", "Processing results", ...).

On the server side, you start a query and, as soon as BQ returns a job ID, store it in some persistent storage (I would choose Datastore, but it could be memcache, a cloudSQL instance, or even a file in GCS).

Then you just need to create a cron job that checks BQ for the status of the unfinished queries, and updates their status in your persistent storage accordingly. Once the BQ job is finished, you can retrieve the results and store them to have them ready when the client checks your service.

As an example, this are the BQ API queries you should do within your app (in here done with curl to provide an example, you can later translate to any language using the idiomatic libraries):

  1. Create the job, retrieve the job id from the response, and store it:

    PROJECT=$(gcloud config get-value project)
    QUERY='SELECT * FROM `bigquery-samples.wikipedia_benchmark.Wiki1k` limit 0'
    curl -H"Authorization: Bearer $(gcloud auth print-access-token)" -H'content-type:application/json' https://www.googleapis.com/bigquery/v2/projects/$PROJECT/jobs -d"
    {
     \"configuration\": {
      \"query\": {
       \"query\": \"$QUERY\",
       \"useLegacySql\": false
      }
     },
     \"jobReference\": {
      \"projectId\": \"$PROJECT\"
     }
    }"|jq -r .jobReference.jobId >> running_jobs
    
  2. Keep querying the BQ API for the job status. (this could be your cron job):

    for job in $(cat running_jobs); do
      if [ $(curl -H"Authorization: Bearer $(gcloud auth print-access-token)" https://www.googleapis.com/bigquery/v2/projects/$PROJECT/jobs/$job|jq -r .status.state) = "DONE" ]; then
        # here your processing part including your callback
        # then remove the job from the list of running jobs
        sed -i "/$job/d" ./running_jobs
      fi
    done
    

You can try this in the cloud shell.

0
votes

One posible approach is to handle the job lifecycle management inside your app code instead of relying on wrappers that do that for you automatically (which are the ones blocking until the job completes).

From Running jobs:

To run a job programmatically:

  1. Start the job by calling the jobs.insert method using a unique job ID generated by your client code. The server generates a job ID for you if you omit it, but it is a best practice to generate the job ID on the client side to allow reliable retry of the jobs.insert call.

    When you call the jobs.insert method, include a job resource representation containing a child property that specifies the job type — load, query, extract, or copy.

  2. Check job status by calling jobs.get with the job ID and check the status.state value to learn the job status. When status.state is DONE, the job has stopped running; however, a DONE status does not mean that the job completed successfully, only that it is no longer running.

    Note: There are some wrapper functions that manage job status requests for you. For example, running jobs.query creates a job and periodically polls for DONE status for a specified period of time.

  3. Check for job success. If the job has a status.errorResult property, the job has failed. The status.errorResult property holds information describing what went wrong in a failed job. If status.errorResult is absent, the job finished successfully, although there might have been some non-fatal errors, such as problems importing a few rows in a load job. Non-fatal errors are returned in the job's status.errors list.

Basically instead of using a single blocking call to wait for the job to complete (queryJob.waitFor(); in your case, the equivalent to jobs.query mentioned in the quote) you can launch the background job, then make repeated calls to query for the job's state (for example in delayed push queue tasks) and when it is complete you actually start processing the results.

Probably also of interest would be the Introduction to BigQuery Jobs and Managing BigQuery Jobs.

Note: the answer is based solely on the docs, I am not yet a BigQuery user.