1
votes

I am running a python script which unloads a table called newdataset.newtable2 from Bigquery to the google storage bucket of my app.

Here is my code:

scope = ["https://www.googleapis.com/auth/bigquery"]
project_id='txxxxxxx9'
dataset_id = 'newdataset'
table_id = 'newtable2'

with open('/home/xxxxxxx/Dropbox/access_keys/google_storage/xxxxxxxx.json') as auth_file:
    key = json.load(auth_file)

client_email = key['client_email']
pv_key = key['private_key']
credentials = SignedJwtAssertionCredentials(client_email, pv_key, scope=scope)

bigquery_service = build('bigquery', 'v2', credentials=credentials)

job_data = {
    'jobReference': {
            'projectId': project_id,
            'jobId': str(uuid.uuid4())
        },
    'configuration': {
        'extract': {
            'sourceTable': {
                'projectId': project_id,
                'datasetId': dataset_id,
                'tableId': table_id,
                },
            'destinationUris': ['gs://xxxxxxx/test.csv'],
            'destinationFormat': 'CSV'
            }
        }
    }

query_job =  bigquery_service.jobs().insert(projectId=project_id, body=job_data).execute()

I am astonished by the slowness of the request. My table is 300Mb and the request lasts for 5 mins. Note that this request does not appear in the job part of my BigQuery UI, but after 5 mins, the .csv can be found in my bucket and looks good.

In Redshift and S3, such a request would last 5 sec no more. My question: Am I doing the right thing? Or am I missing something?

If my code is good, can someone tell me why this basic task takes so much time?

Note : I am using a free account for now (not upgraded)

1
How many times did you run and 5 minute is the average or it was just once? - Pentium10
I tried several time, changing the name of the output file on gs. And 5 mins is the average time. Even when I do it from the console, the time is 5 mins. It is never under 5 mins in fact. Note that my bucket can be in europe-west, I am not sure how to know where it is. - sweeeeeet
I think you should submit a performance issue ticket code.google.com/p/google-bigquery please mention your project and provide details so the engineers can identify your problem job. - Pentium10

1 Answers

1
votes

The way you've formulated your request, it is writing a single 300 MB CSV file in a single worker. This is going to be fairly slow. (5 minutes is still longer than I'd expect, but within a reasonable realm)

If you use a glob pattern (e.g. gs://xxxxxxx/test*.csv) in your destination URI, it should be much faster since it can be done in parallel.