3
votes

Im currently trying to write a Pandas Dataframe (Python 3.x) into Google Big Query. The table has a column with dtype object that contains an array of string values.

sample of pandas table I aim to create a BQ table that maintains a nested table structure as below: sample of Big Query table with following schema: schema of Big Query table

Im using the google-cloud-bigquery library as that allows the df to convert to the Parquet format that per documentation supports nested array values:

code used:

client = bigquery.Client()
table_id = 'dataset.table'

job_config = bigquery.LoadJobConfig(
    schema = [
            bigquery.SchemaField('route_id', 'INTEGER'),
            bigquery.SchemaField('types', 'STRING', mode='REPEATED')
    ], 
    writeDisposition="WRITE_APPEND"
)

job = client.load_table_from_dataframe(
    df, 
    table_id, 
    job_config=job_config,
)

# Wait for the load job to complete.
job.result()

but unfortunately Im getting the following error message returned:

BadRequest: 400 Error while reading data, error message: Provided schema is not compatible with the file 'prod-scotty-76a528bc-407d-4224-8951-c8ff0c71faa1'. Field 'types' is specified as REPEATED in provided schema which does not match NULLABLE as specified in the file.

What has been tried so far:

  1. used RECORD field type

but that caused the following error: https://github.com/googleapis/python-bigquery/issues/21

  1. not shipping any schema in python at all (and allow Python/BQ to sort it out themselves)

surprisingly this works for the first iteration (CREATE_IF_NEEDED)creating a table in BQ that maintained the nested structure with the following schema automatically applied: auto-applied schema of BQ table, but fails if you try to append even the exact table again returning the same error as under 1.

Any advise or tips?

1

1 Answers

0
votes

There seems to be a mismatch between the schema in parquet and the interpreted schema for arrays when you load the data to BigQuery that has not been resolved yet.

I have been able to properly upload a dataframe with an array by using the open-source library pandas-gcp:

import pandas as pd
import pandas_gbq

d = {'nested_string': [['hi', 'keloke'], ['io', 'ready']], 'route_id': [83833, 4487]}
df = pd.DataFrame(data = d)

table_id = "dataset.table"
project_id = 'my_project'

pandas_gbq.to_gbq(
    df, table_id, project_id=project_id, if_exists='replace',
)

Other possible workarounds without 3rd party tools:

· Use Dataflow instead

· From the python file, save the dataframe as a csv in a Google Storage Bucket and have it ingested from BigQuery

Do you think any of this could work for you?