1
votes

I'm create a Datalake in AWS based on Athena and I would like to query data that I now store in Google Analytics. As I understand I don't have access to the raw data of Analytics, but I can export it to BigQuery and from there I can export it again to GCS (Google Cloud Storage). I know that I can create an automated process to export the data from Analytics to BigQuery.

How can I create (easily) the same export from BigQuery to GCS?

Also, what will be the easiest way to export all the historical data? I saw that I can do an export from the BigQuery console, but it export data for a single day only and this service is running for a while now.

Once all the data is in GCS I guess I can run an AWS Lambda to copy the data to my AWS account, so I could query it.

1

1 Answers

2
votes

According to the documentation you are not allowed to export data from more than one table in a single job.

If you need to automate the export I suggest that you use a Python script as below.

Before using this script, keep in mind that you need to install the BigQuery SDK for Python. You can do that by running pip install google-cloud-bigquery in your terminal. Keep in mind too that this code is considering that you want to export all the tables in the given dataset. If you have other tables in the dataset than the ones that you want to export, you'll need to filter the correct tables.

from google.cloud import bigquery as bq

# Defining the variables below to make the code cleaner
project_id = "your-project-id"
dataset_id = "your-dataset-id"

# Starting client
client = bq.Client(project=project_id)

# Getting list of tables in your dataset
t = client.list_tables(dataset=dataset_id)
# Creating reference yo your dataset
dataset_ref = bigquery.DatasetReference(project_id, dataset_id)


# The loop below will repeat for all the tables listed in the dataset
# The destination is in the format gs://<your-bucket>/<some-folder>/filename
# The filename is in the format export_<table_name>_<hash>.csv
# This hash is created by the wildcard (*). The wildcard is needed when 
# your export is likely to generate a file bigger than 1 GB


for i in t:
    table_id = i.table_id
    table_ref = dataset_ref.table(table_id)
    destination = "gs://your-bucket/your-folder/"+ table_id + "/export_" + table_id + "_*.csv"
    extract_job = client.extract_table(table_ref, destination, location="US")