I am working on exporting large dataset from bigquery to Goolge cloud storage into the compressed format. In Google cloud storage I have file size limitation( maximum file size 1GB each file). Therefore I am using split and compassion techniques to split data while exporting. The sample code is as follow:
gcs_destination_uri = 'gs://{}/{}'.format(bucket_name, 'wikipedia-*.csv.gz')
gcs_bucket = storage_client.get_bucket(bucket_name)
# Job Config
job_config = bigquery.job.ExtractJobConfig()
job_config.compression = bigquery.Compression.GZIP
def bigquery_datalake_load():
dataset_ref = bigquery_client.dataset(dataset_id, project=project)
table_ref = dataset_ref.table(table_id)
table = bigquery_client.get_table(table_ref) # API Request
row_count = table.num_rows
extract_job = bigquery_client.extract_table(
table_ref,
gcs_destination_uri,
location='US',
job_config=job_config) # API request
logging.info('BigQuery extract Started.... Wait for the job to complete.')
extract_job.result() # Waits for job to complete.
print('Exported {}:{}.{} to {}'.format(
project, dataset_id, table_id, gcs_destination_uri))
# [END bigquery_extract_table]
This code is splitting the large dataset and compressing into .gz
format but it is returning multiple compressed files which size is rounding between 40MB to 70MB.
I am trying to generate the compressed file with the size of 1GB (each file). Is there any way to get this done?