4
votes

When running locally, my Jupyter notebook is able to reference Google BigQuery like so:

%%bigquery some_bq_table

SELECT *
FROM
  `some_bq_dataset.some_bq_table` 

So that later in my notebook I can reference some_bq_table as a pandas dataframe, as exemplified here: https://cloud.google.com/bigquery/docs/visualize-jupyter

I want to run my notebook on AWS SageMaker to test a few things. To authenticate with BigQuery it seems that the only two ways are using a service account on GCP (or locally) or pointing the the SDK to a credentials JSON using an env var (as explained here: https://cloud.google.com/docs/authentication/getting-started).

For example

export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/[FILE_NAME].json"

Is there an easy way to connect to bigquery from SageMaker? My best idea right now is to download the JSON from somewhere to the SageMaker instnace and then set the env var from the python code.

For example, I would do this:

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/home/user/Downloads/[FILE_NAME].json"

However, this isn't very secure - I don't like the idea of downloading my credentials JSON to a SageMaker instance (this means I would have to upload the credentials to some private s3 bucket and then store them on the SageMaker instance). Not the end of the world but I rather avoid this.

Any ideas?

1

1 Answers

7
votes

As you mentioned GCP currently authenticates using service account, credentials JSON and API tokens. Instead of storing credentials in S3 bucket you can consider using AWS Secrets Manager or AWS Systems Manager Parameter Store to store the GCP credentials and then fetch them in Jupyter notebook. This way credentials can be secured and the credentials file will be created from Secrets Manager only when needed.

This is sample code I used previously to connect to BigQuery from SageMaker instance.

import os
import json
import boto3
from google.cloud.bigquery import magics
from google.oauth2 import service_account

def get_gcp_credentials_from_ssm(param_name):
    # read credentials from SSM parameter store
    ssm = boto3.client('ssm')
    # Get the requested parameter
    response = ssm.get_parameters(Names=[param_name], WithDecryption=True)
    # Store the credentials in a variable
    gcp_credentials = response['Parameters'][0]['Value']
    # save credentials temporarily to a file
    credentials_file = '/tmp/.gcp/service_credentials.json'
    with open(credentials_file, 'w') as outfile:  
        json.dump(json.loads(gcp_credentials), outfile)
    # create google.auth.credentials.Credentials to use for queries 
    credentials = service_account.Credentials.from_service_account_file(credentials_file)
    # remove temporary file
    if os.path.exists(credentials_file):
        os.remove(credentials_file)
    return credentials

# this will set the context credentials to use for queries performed in jupyter 
# using bigquery cell magic
magics.context.credentials = get_gcp_credentials_from_ssm('my_gcp_credentials')

Please note that SageMaker execution role should have access to SSM and of course other necessary route to connect to GCP. I am not sure if this is the best way though. Hope someone has better way.