2
votes

I have a dataset at BigQuery with 100 thousand+ rows and 10 columns. I'm also continuously adding new data to the dataset. I want to fetch data that not processed, process them and write back to my table. Currently, I'm fetching them to a pandas dataframe using bigquery python library and processing using pandas.

Now, I want to update table with new pre-processed data. One way of doing it using SQL statement and calling query function of the bigquery.Client() class. Or use a job like here.

bqclient = bigquery.Client(
            credentials=credentials,
            project=project_id,
            )

query = """UPDATE `dataset.table` SET field_1 = '3' WHERE field_2 = '1'"""
bqclient.query(query_string)

But it doesn't make sense to create update statement for each row.

Another way I found is using to_gbq function of pandas-gbq package. Disadvantage of this , it updates all table.

Question: What is the best way of updating Bigquery table from pandas dataframe?

1

1 Answers

2
votes

Google BigQuery is mainly used for Data Analysis when your data is static and you don't have to update a value, since the arquitecture is basically to do that kind of thinking. Therefore, if you want to update the data, there are some options but are very heavy:

  1. The one you mentioned, with a query and update one by one row.
  2. Recreate the table using only the new values.
  3. Appending the new data with different timestamp.
  4. Using partitioned tables [1] and if possible clustered tables [2], this way when you want to update the table you can use the partitioned and clustered columns to update it and the query will be less heavy. Also, you can append the new data in a new partitioned table, let's say on the current day.

If you are using the data for analytical reasons, maybe the best option is 2 and 3, but I always recommend having [1] and [2].

[1] https://cloud.google.com/bigquery/docs/querying-partitioned-tables

[2] https://cloud.google.com/bigquery/docs/clustered-tables