How to parse a JSON string into a StructType with a specific schema in Pandas and send to a BigQuery table with Record type

Question

I have a pandas dataframe with one column being a Json object. I want to send the dateframe to a BigQuery table and map the json object to a column with a record type. But it is mapped to a column with a string type instead. In the Azure Databricks notebook, I was able do this by using from_json with my own defined schema. For example

df = df.withColumn("value", from_json(col("nestedJson"), jsonSchema))

But I am using Jupyter notebook in GCP and there is no from_json method there. Any suggestions?

C.Georgiadis C.Georgiadis · Accepted Answer · 2020-12-26T18:34:45

The following works for me

Let's define a dummy dataframe with two columns (id and nestedJson)

import pandas as pd
    
d = {'id': [1,2,3], 'nestedJson': [{'k1':1, 'k2':2}, {'k1':3, 'k2':4}, {'k1':5, 'k2':6}]}
    
df = pd.DataFrame(d)

So df looks like this

	id	nestedJson
0	1	`{'k1': 1, 'k2': 2}`
1	2	`{'k1': 3, 'k2': 4}`
2	3	`{'k1': 5, 'k2': 6}`

Now we create a bigquery Client, set the configuration for the load job (with schema autodetection) and load the data from a dataframe according to the BQ documentation (https://cloud.google.com/bigquery/docs/samples/bigquery-load-table-dataframe)

from google.cloud import bigquery

table_id = 'my_dataset.my_project.my_table' # Change this for your destination 

client = bigquery.Client()
job_config = bigquery.LoadJobConfig(
    autodetect=True,
    write_disposition="WRITE_TRUNCATE"
)

job = client.load_table_from_dataframe(df, table_id, job_config=job_config)
job.result()  # Wait for the job to complete.

Finally you can check the job's result

table = client.get_table(table_id)  # Make an API request.
print(
    "Loaded {} rows and {} columns".format(
        table.num_rows, len(table.schema)
    )
)

Which should print

Loaded 3 rows and 2 columns

Your BQ table should look like this

Schema

Field name	Type	Mode
id	INTEGER	NULLABLE
nestedJson	RECORD	NULLABLE
nestedJson.k1	INTEGER	NULLABLE
nestedJson.k2	INTEGER	NULLABLE

Preview

Row	id	nestedJson.k1	nestedJson.k2
1	1	1	2
2	2	3	4
3	3	5	6

Finally it's worth noting that while using schema autodetection works fine, if you try to explicitly set the schema, you'll probably run into this issue https://github.com/googleapis/python-bigquery/issues/19

How to parse a JSON string into a StructType with a specific schema in Pandas and send to a BigQuery table with Record type

1 Answers