4
votes

I am working on exporting large dataset from bigquery to Goolge cloud storage into the compressed format. In Google cloud storage I have file size limitation( maximum file size 1GB each file). Therefore I am using split and compassion techniques to split data while exporting. The sample code is as follow:

gcs_destination_uri = 'gs://{}/{}'.format(bucket_name, 'wikipedia-*.csv.gz') 
gcs_bucket = storage_client.get_bucket(bucket_name)

# Job Config
job_config = bigquery.job.ExtractJobConfig()
job_config.compression = bigquery.Compression.GZIP

def bigquery_datalake_load():  
    dataset_ref = bigquery_client.dataset(dataset_id, project=project)
    table_ref = dataset_ref.table(table_id)
    table = bigquery_client.get_table(table_ref)  # API Request
    row_count = table.num_rows

    extract_job = bigquery_client.extract_table(
        table_ref,
        gcs_destination_uri,
        location='US',
        job_config=job_config)  # API request
    logging.info('BigQuery extract Started.... Wait for the job to complete.')
    extract_job.result()  # Waits for job to complete.

    print('Exported {}:{}.{} to {}'.format(
        project, dataset_id, table_id, gcs_destination_uri))
    # [END bigquery_extract_table]

This code is splitting the large dataset and compressing into .gz format but it is returning multiple compressed files which size is rounding between 40MB to 70MB.

I am trying to generate the compressed file with the size of 1GB (each file). Is there any way to get this done?

3

3 Answers

3
votes

Unfortunately no - Google adjust it by itself - you do not have options to specify size. I believe it is because of size of uncompressed data (so each BQ worker produced one file and it is impossible to produce one file from multiple workers)

0
votes

I think it is possible. You need to know the total size of your export and knowing this you can split the results using Multiple wildcard URIs. {1}

For example, if your export is 10GB you can define 10 wildcard URIs and they will be 1GB of size aprox.

You have a similar question answered here: {2}

{1}: https://cloud.google.com/bigquery/docs/exporting-data#exporting_data_into_one_or_more_files

{2}: Exporting data to GCS from BigQuery - Split file size control

0
votes

As other members commented, it is not possible. I think this issue more than having to deal with a google specific algorithm, it deals with the question of whether or not compressed files could have its compressed size calculated before doing it's actual compression and the answer is not.

Because all compression techniques do make use of some sort of dictionary, the final size of any file is known only after all symbols had been translated, which means the compression has been done. There is another post of SO in which this is discussed, and this article does explain some fundamentals about compression.

Files could be "padded" in order to make them of a uniform bigger size, but that would defeat the objective of compression (which is of saving space).