0
votes

Trying to export the table data in BigQuery to buckets created in Google Cloud Storage.

When I export the table in BigQuery to GCS with single wildcard URI, it automatically splits the table into multiple sharded files (around 368 MB per file) and land in the designated buckets in GCS.

Here is the command:

bq --nosync extract --destination_format=CSV '<bq table>' 'gs://<gcs_bucket>/*.csv'

The file size and number of files remains the same (around 368 MB per file) even with the use of multiple URIs:

bq --nosync extract --destination_format=CSV '<bq table>' 'gs://<gcs_bucket>/1-*.csv','gs://<gcs_bucket>/2-*.csv','gs://<gcs_bucket>/3-*.csv','gs://<gcs_bucket>/4-*.csv','gs://<gcs_bucket>/5-*.csv'

I am trying to figure out how to use multiple URIs option to reduce the file size.

1

1 Answers

1
votes

I believe BigQuery does not provide any guarantee on the file size produced, so what you observed is correct: the file size may not differ with or without multiple wildcard URIs specified.

The common use case for multiple wildcard URIs is that it tells BigQuery to distribute the output files for you into N patterns evenly, so that you can feed each output URI pattern to a downstream worker.