5
votes

UPDATE: I've narrowed the problem a bit, so I'm removing unnecessary code and examples:

UPDATE 2: after leaving the cron job running at 12 hour intervals for quite some time (with each ending in success, but with nothing written in BQ) we were shocked to discover that, after about a week, one of the cron jobs did successfully write to BigQuery, along with Stackdriver logs stating that "this request caused a new process to be started for your application (...)", as copied below. The following jobs again stopped writing. Now I'm wondering if this is somehow connected to either cached app state (with some expiration period) or credential expiration date which somehow prevent further writes to BigQuery after the first, but do not result in errors.

Problem description:

I'm am trying to set up a cron job in App Engine (standard) to query from and write data back to BigQuery (datasets are in the same project as the deployed app) and cron jobs execute successfully but only write to BigQuery on the first execution after deployment, afterwards they still execute with success but do not write.

The main difference I've found is in Stackdriver logs, for executions which correctly write there are extra debug and informations, for subsequent ones there are no such messages:

2018-04-19 04:44:03.933 CEST
Converted retries value: 3 -> Retry(total=3, connect=None, read=None, redirect=None, status=None) (/base/data/home/apps/e~<redacted>/lib/urllib3/util/retry.py:200)
2018-04-19 04:44:04.154 CEST
Making request: POST https://accounts.google.com/o/oauth2/token (/base/data/home/apps/e~<redacted>/lib/google/auth/transport/requests.py:117)
2018-04-19 04:44:04.160 CEST
Starting new HTTPS connection (1): accounts.google.com (/base/data/home/apps/e~<redacted>/lib/urllib3/connectionpool.py:824)
2018-04-19 04:44:04.329 CEST
https://accounts.google.com:443 "POST /o/oauth2/token HTTP/1.1" 200 None (/base/data/home/apps/e~<redacted>/lib/urllib3/connectionpool.py:396)
2018-04-19 04:44:04.339 CEST
Starting new HTTPS connection (1): www.googleapis.com (/base/data/home/apps/e~<redacted>/lib/urllib3/connectionpool.py:824)
2018-04-19 04:44:04.802 CEST
https://www.googleapis.com:443 "POST /bigquery/v2/projects/<redacted>/jobs HTTP/1.1" 200 None (/base/data/home/apps/e~<redacted>/lib/urllib3/connectionpool.py:396)
2018-04-19 04:44:04.813 CEST
This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This request may thus take longer and use more CPU than a typical request for your application.

I've tried:

  • Adding BigQuery DataOwner and User permissions for default appengine service account, but there was no effect.

  • there are mentions that google.cloud library is not fully supported for standard app engine, so I tried using OAuth2/httplib2/googleapiclient credentials to authenticate but this is the first time I've tried it and I don't understand how to put the pieces together, and without google.cloud library I don't know how to even write a proper query for BQ

  • other credential setup approaches as suggested below but it seems that connecting to BQ is not the issue, they all connect and write (once), just repeating it within already deployed app engine.

Below is the full implementation:

app.yaml:

runtime: python27
api_version: 1
threadsafe: true

handlers:
- url: /bigquerycron
  script: bigquerycron.app
  login: admin

libraries:
- name: ssl
  version: latest

env_variables:
  GAE_USE_SOCKETS_HTTPLIB : 'true'

bigquerycron.py

from __future__ import absolute_import
from google.cloud import bigquery
import webapp2

class MainPage(webapp2.RequestHandler):
    def get(self):
        self.response.headers['Content-Type'] = 'text/plain'
        self.response.write('CRON test page')          

def writeDataTest(dataset_id = '<redacted>',table_id='<redacted>'):
    client = bigquery.Client.from_service_account_json("credentials.json")
    job_config = bigquery.QueryJobConfig()
    table_ref = client.dataset(dataset_id).table(table_id)
    job_config.destination = table_ref
    job_config.write_disposition = 'WRITE_APPEND'

    query_job = client.query(
    """SELECT CURRENT_DATETIME() AS Datetime, 'CRON' as Source""", job_config=job_config)

writeDataTest()

app = webapp2.WSGIApplication([
      ('/bigquerycron', MainPage),
], debug=True)

cron.yaml:

cron:
- url: /bigquerycron
  schedule: every 30 minutes
3
Are you running this on a compute engine or are you logging into the cloud shell every time you want to perform the operation? - Ben P
This is running as an app deployed with App Engine, for testing I run these scripts (executing .py file manually) through cloud shell and with that I have no issues, but the problem is when it's scheduled with automated cron jobs. - Vaeqal
Have you tried scheduling with Crontab on the instance instead of Cron jobs, just to eliminate this as a potential issue? Also, you could explicitly add .json client credentials to your Python code to use a service account instead? - Ben P
I want to use cron jobs because that's just the first use case we want to implement with it. As for credentials I never worked with them before (any form of authentication really), could you modify the first bigquerycron.py from my post with an example so I could test it? - Vaeqal
I use client = bigquery.Client.from_service_account_json("my_file.json") in my automated Python script, where this file is in the local directory on the instance. Try using this in your code and adding in your .json credentials file. - Ben P

3 Answers

2
votes

Credentials were not the problem in this specific case, the issue is simply with placement of the function call due to misunderstanding of how App Engine works. Function call for bigquery should be moved inside the MainPage class definition, fixed bigquerycron.py looks like this (only a single line of code is moved):

from __future__ import absolute_import
from google.cloud import bigquery
import webapp2

class MainPage(webapp2.RequestHandler):
    def get(self):
        self.response.headers['Content-Type'] = 'text/plain'
        self.response.write('CRON test page')          
        writeDataTest()

def writeDataTest(dataset_id = '<redacted>',table_id='<redacted>'):
    client = bigquery.Client.from_service_account_json("credentials.json")
    job_config = bigquery.QueryJobConfig()
    table_ref = client.dataset(dataset_id).table(table_id)
    job_config.destination = table_ref
    job_config.write_disposition = 'WRITE_APPEND'
    query_job = client.query(
    """SELECT CURRENT_DATETIME() AS Datetime, 'CRON' as Source""", job_config=job_config)

app = webapp2.WSGIApplication([
      ('/bigquerycron', MainPage),
], debug=True)

Version in OP indeed only writes once to BigQuery, when App Engine app is loaded for the first time, all subsequent calls just execute MainPage class, which in this case does nothing, as actual BigQuery code is outside of it.

Additionally it would be beneficial to rewrite the app without using google-cloud-python library, which is not supported in GAE Standard (https://github.com/GoogleCloudPlatform/google-cloud-python/issues/1893). This is especially unfortunate since even official bigquery documentation for python (https://cloud.google.com/bigquery/docs/) makes use of this library. There is a variety of workarounds to keep using it however, including some mentioned in the linked github issue and also here: Using gcloud-python in GAE and a similar workaround was used in this example.

But as mentioned, it would be best to use a dedicated Google API Client Library for Python: https://developers.google.com/api-client-library/python/

0
votes

I suspect that if you remove the "login: admin" part of the app.yaml, it will work.

If that's the issue, please ensure you have the proper X-Appengine header setup

Here's some doc for task queues and for cron jobs.

0
votes

Although I am not sure of the reason, I think authorizing the service account of App engine is not enough for accessing BigQuery.

In order to authorize your App to access to BigQuery you can do either of both methods:

  1. Within the app.yaml file, configure an environment variable pointing to a a service account key file with a correct authorization configuration to BigQuery :

    env_variables: GOOGLE_APPLICATION_CREDENTIALS=[YOURKEYFILE].json

  2. Your code executes a fetch of an authorized service account key from a bucket, and loads it afterwards with the help of the Cloud storage Client library. Seeing your runtime is python, the code you should use is the following:

    ....

    from google.cloud import storage

    ....

    def download_key():

    """Downloads the key from the bucket."""

    storage_client = storage.Client()

    bucket = storage_client.get_bucket('YOURBUCKET')

    blob = bucket.blob('Keynameinthebucket.json')

    blob.download_to_filename('Keynameinyourapp.json')

    ....

    #within the code:

    download_key()

    client=bigquery.Client.from_service_account_json('keynameinyourapp.json')