I am currently exporting data from Bigquery to GCS buckets. I am doing this programmatically using the following query:
query_request = bigquery_service.jobs()
DATASET_NAME = "#######";
PROJECT_ID = '#####';
DATASET_ID = 'DestinationTables';
DESTINATION_PATH = 'gs://bucketname/foldername/'
query_data = {
'projectId': '#####',
'configuration': {
'extract': {
'sourceTable': {
'projectId': PROJECT_ID,
'datasetId': DATASET_ID,
'tableId': #####,
},
'destinationUris': [DESTINATION_PATH + my-files +'-*.gz'],
'destinationFormat': 'CSV',
'printHeader': 'false',
'compression': 'GZIP'
}
}
}
query_response = query_request.insert(projectId=constants.PROJECT_NUMBER,
body=query_data).execute()
Since there is a constraint that only 1GB per file can be exported to GCS, I used the single wildcard URI (https://cloud.google.com/bigquery/exporting-data-from-bigquery#exportingmultiple). This splits the file into multiple smaller parts. After splitting, each of the file parts are gzipped as well.
My question: Can I control the file sizes of the split files? For example, if I have a 14GB file to export to GCS, this will be split into 14 1GB files. But is there a way to change that 1GB into another size (smaller than 1GB before gzipping)? I checked the various parameters that are available for modifying the configuration.extract object? (Refer: https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.extract)