0
votes

I have a pandas dataframe with one column being a Json object. I want to send the dateframe to a BigQuery table and map the json object to a column with a record type. But it is mapped to a column with a string type instead. In the Azure Databricks notebook, I was able do this by using from_json with my own defined schema. For example

df = df.withColumn("value", from_json(col("nestedJson"), jsonSchema))

But I am using Jupyter notebook in GCP and there is no from_json method there. Any suggestions?

1

1 Answers

1
votes

The following works for me

Let's define a dummy dataframe with two columns (id and nestedJson)

import pandas as pd
    
d = {'id': [1,2,3], 'nestedJson': [{'k1':1, 'k2':2}, {'k1':3, 'k2':4}, {'k1':5, 'k2':6}]}
    
df = pd.DataFrame(d)

So df looks like this

id nestedJson
0 1 {'k1': 1, 'k2': 2}
1 2 {'k1': 3, 'k2': 4}
2 3 {'k1': 5, 'k2': 6}

Now we create a bigquery Client, set the configuration for the load job (with schema autodetection) and load the data from a dataframe according to the BQ documentation (https://cloud.google.com/bigquery/docs/samples/bigquery-load-table-dataframe)

from google.cloud import bigquery

table_id = 'my_dataset.my_project.my_table' # Change this for your destination 

client = bigquery.Client()
job_config = bigquery.LoadJobConfig(
    autodetect=True,
    write_disposition="WRITE_TRUNCATE"
)

job = client.load_table_from_dataframe(df, table_id, job_config=job_config)
job.result()  # Wait for the job to complete.

Finally you can check the job's result

table = client.get_table(table_id)  # Make an API request.
print(
    "Loaded {} rows and {} columns".format(
        table.num_rows, len(table.schema)
    )
)

Which should print

Loaded 3 rows and 2 columns

Your BQ table should look like this

Schema

Field name Type Mode
id INTEGER NULLABLE
nestedJson RECORD NULLABLE
nestedJson.k1 INTEGER NULLABLE
nestedJson.k2 INTEGER NULLABLE

Preview

Row id nestedJson.k1 nestedJson.k2
1 1 1 2
2 2 3 4
3 3 5 6

Finally it's worth noting that while using schema autodetection works fine, if you try to explicitly set the schema, you'll probably run into this issue https://github.com/googleapis/python-bigquery/issues/19