I would like to export a 90 TB BigQuery table to Google Cloud Storage. According to the documentation, BigQuery Storage API (beta) should be the way to go due to export size quotas (e.g., ExtractBytesPerDay) associated with other methods.
The table is date-partitioned, with each partition occupying ~300 GB. I have a Python AI Notebook running on GCP, which runs partitions (in parallel) through this script adapted from the docs.
from google.cloud import bigquery_storage_v1
client = bigquery_storage_v1.BigQueryReadClient()
table = "projects/{}/datasets/{}/tables/{}".format(
"bigquery-public-data", "usa_names", "usa_1910_current"
) # I am using my private table instead of this one.
requested_session = bigquery_storage_v1.types.ReadSession()
requested_session.table = table
requested_session.data_format = bigquery_storage_v1.enums.DataFormat.AVRO
parent = "projects/{}".format(project_id)
session = client.create_read_session(
parent,
requested_session,
max_stream_count=1,
)
reader = client.read_rows(session.streams[0].name)
# The read stream contains blocks of Avro-encoded bytes. The rows() method
# uses the fastavro library to parse these blocks as an iterable of Python
# dictionaries.
rows = reader.rows(session)
Is it possible to save data from the stream directly to Google Cloud Storage?
I tried saving tables as AVRO files to my AI instance using fastavro and later uploading them to GCS using Blob.upload_from_filename(), but this process is very slow. I was hoping it would be possible to point the stream at my GCS bucket. I experimented with Blob.upload_from_file, but couldn't figure it out.
I cannot decode the whole stream to memory and use Blob.upload_from_string because I don't have over ~300 GB of RAM.
I spent the last two days parsing GCP documentation, but couldn't find anything, so I would appreciate your help, preferably with a code snippet, if at all possible. (If working with another file format is easier, I am all for it.)
Thank you!