2
votes

I am using Bigquery python libraries to export data from Bigquery tables into GCS in csv format.

I have given a wildcard pattern assuming some tables can be more than 1 GB

Sometimes even though table is few MB it creates multiple files and sometimes just it creates just 1 file.

Is there a logic behind this?

My export workflow is the following:

project = bq_project dataset_id = bq_dataset_id table_id = bq_table_id     
bucket_name =bq_bucket_name workflow_name=workflow_nm 
csv_file_nm=workflow_nm+"/"+csv_file_prefix_in_gcs+'*'client = 
bigquery.Client() destination_uri = "gs://{}/{}".format(bucket_name, 
csv_file_nm) dataset_ref = client.dataset(dataset_id, project=project) 
table_ref = dataset_ref.table(table_id) destination_table = 
client.get_table(dataset_ref.table(table_id)) configuration = 
bigquery.job.ExtractJobConfig() configuration.destination_format='CSV' – 
csv_file_nm=workflow_nm+"/"+csv_file_prefix_in_gcs 
1
wich wildcard are you using? gs://my-bucket/file-name-*.json or gs://my-bucket/file-name-<worker number>-*.json ? - Chris32
project = bq_project dataset_id = bq_dataset_id table_id = bq_table_id bucket_name =bq_bucket_name workflow_name=workflow_nm csv_file_nm=workflow_nm+"/"+csv_file_prefix_in_gcs+'*'client = bigquery.Client() destination_uri = "gs://{}/{}".format(bucket_name, csv_file_nm) dataset_ref = client.dataset(dataset_id, project=project) table_ref = dataset_ref.table(table_id) destination_table = client.get_table(dataset_ref.table(table_id)) configuration = bigquery.job.ExtractJobConfig() configuration.destination_format='CSV' - Sreekanth
csv_file_nm=workflow_nm+"/"+csv_file_prefix_in_gcs - Sreekanth

1 Answers

1
votes

I think this is an intended behaviour of the export. The Bigquery Export documentation specifies the following:

When you export data to multiple files, the size of the files will vary.

This corresponds to the behavior you are seeing in your exports.