3
votes

I am currently exporting data from Bigquery to GCS buckets. I am doing this programmatically using the following query:

query_request = bigquery_service.jobs()

DATASET_NAME = "#######";
PROJECT_ID = '#####';
DATASET_ID = 'DestinationTables';

DESTINATION_PATH = 'gs://bucketname/foldername/'
query_data = {
'projectId': '#####',
'configuration': {
  'extract': {
    'sourceTable': {
            'projectId': PROJECT_ID,
            'datasetId': DATASET_ID,
            'tableId': #####,
     },
    'destinationUris': [DESTINATION_PATH + my-files +'-*.gz'],
    'destinationFormat': 'CSV',
    'printHeader': 'false',
    'compression': 'GZIP'
   }
 }

}

query_response = query_request.insert(projectId=constants.PROJECT_NUMBER,
                                     body=query_data).execute()

Since there is a constraint that only 1GB per file can be exported to GCS, I used the single wildcard URI (https://cloud.google.com/bigquery/exporting-data-from-bigquery#exportingmultiple). This splits the file into multiple smaller parts. After splitting, each of the file parts are gzipped as well.

My question: Can I control the file sizes of the split files? For example, if I have a 14GB file to export to GCS, this will be split into 14 1GB files. But is there a way to change that 1GB into another size (smaller than 1GB before gzipping)? I checked the various parameters that are available for modifying the configuration.extract object? (Refer: https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.extract)

1

1 Answers

3
votes

If you specify multiple URI patterns, the data will be sharded between them. So if you used, say, 28 URI patterns, each shard would be about half a GB. You'd end up with second files of size zero for each pattern, as this is really meant for MR jobs, but its one way to accomplish what you want.

More info here (see the Multiple Wildcard URIs section): Exporting Data From BigQuery