0
votes

I am using the BigQuery client libraries to perform a data ETL jpb, then load data back into BigQuery.

I'd like to overwrite the destination table every time, but currently my code is appending new data to the table every time it is run. I've read the documentation on job_config, and I have used this to set parameters for queries, but I can't figure out how to set a write disposition for the query.

Here is what I have tried so far:

roc_df = pd.DataFrame(roc_score)

job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE

dataset_ref = client.dataset('Customers')
table_ref = dataset_ref.table('propensity_scores_test')

client.load_table_from_dataframe(roc_df, table_ref, job_config=job_config).result()

And I also tried this format:

query_config = bigquery.QueryJobConfig(
    query_parameters=[
        bigquery.job.WriteDisposition('WRITE_TRUNCATE')
    ]
)

But both are currently returning the error:

BadRequest: 400 POST https://www.googleapis.com/upload/bigquery/v2/projects/my_project/jobs?uploadType=resumable: Required parameter is missing

Hoe can I write my data out and replace the table each time?

1
Are you trying to write/truncate the table from the result of query or a load job?Graham Polley
I think this question might help youTamir Klein
@GrahamPolley I'm not sure! I'm loading data in using train = client.query(training_query).to_dataframe() without any config set - I only need to use a config later, to set the write rules, perhaps this is confusing things?Ben P

1 Answers

3
votes

The load_table_from_dataframe method uses a LoadJobConfig. Here's a working snippet of code:

from google.cloud import bigquery
import pandas as pd

roc_df = pd.DataFrame([{"firstName": "Foo", "lastName": "Bar"}])

client = bigquery.Client()

dataset_ref = client.dataset('my_dataset')
table_ref = dataset_ref.table('my_table')

job_config = bigquery.job.LoadJobConfig()
job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE

client.load_table_from_dataframe(roc_df, table_ref, job_config=job_config)

The only change with your code would be:

job_config = bigquery.job.LoadJobConfig()