How to write array of string values from Pandas to Google Big Query

Question

Im currently trying to write a Pandas Dataframe (Python 3.x) into Google Big Query. The table has a column with dtype object that contains an array of string values.

sample of pandas table I aim to create a BQ table that maintains a nested table structure as below: sample of Big Query table with following schema: schema of Big Query table

Im using the google-cloud-bigquery library as that allows the df to convert to the Parquet format that per documentation supports nested array values:

code used:

client = bigquery.Client()
table_id = 'dataset.table'

job_config = bigquery.LoadJobConfig(
    schema = [
            bigquery.SchemaField('route_id', 'INTEGER'),
            bigquery.SchemaField('types', 'STRING', mode='REPEATED')
    ], 
    writeDisposition="WRITE_APPEND"
)

job = client.load_table_from_dataframe(
    df, 
    table_id, 
    job_config=job_config,
)

# Wait for the load job to complete.
job.result()

but unfortunately Im getting the following error message returned:

BadRequest: 400 Error while reading data, error message: Provided schema is not compatible with the file 'prod-scotty-76a528bc-407d-4224-8951-c8ff0c71faa1'. Field 'types' is specified as REPEATED in provided schema which does not match NULLABLE as specified in the file.

What has been tried so far:

used RECORD field type

but that caused the following error: https://github.com/googleapis/python-bigquery/issues/21

not shipping any schema in python at all (and allow Python/BQ to sort it out themselves)

surprisingly this works for the first iteration (CREATE_IF_NEEDED)creating a table in BQ that maintained the nested structure with the following schema automatically applied: auto-applied schema of BQ table, but fails if you try to append even the exact table again returning the same error as under 1.

Any advise or tips?

Albert Albesa Albert Albesa · Accepted Answer · 2020-03-16T09:01:24

There seems to be a mismatch between the schema in parquet and the interpreted schema for arrays when you load the data to BigQuery that has not been resolved yet.

I have been able to properly upload a dataframe with an array by using the open-source library pandas-gcp:

import pandas as pd
import pandas_gbq

d = {'nested_string': [['hi', 'keloke'], ['io', 'ready']], 'route_id': [83833, 4487]}
df = pd.DataFrame(data = d)

table_id = "dataset.table"
project_id = 'my_project'

pandas_gbq.to_gbq(
    df, table_id, project_id=project_id, if_exists='replace',
)

Other possible workarounds without 3rd party tools:

· Use Dataflow instead

· From the python file, save the dataframe as a csv in a Google Storage Bucket and have it ingested from BigQuery

Do you think any of this could work for you?

How to write array of string values from Pandas to Google Big Query

1 Answers