1
votes

I am just getting started with the python BigQuery API (https://github.com/GoogleCloudPlatform/google-cloud-python/tree/master/bigquery) after briefly trying out (https://github.com/pydata/pandas-gbq) and realizing that the pandas-gbq does not support RECORD type, i.e. no nested fields.

Now I am trying to upload nested data to BigQuery. I managed to create the table with the respective Schema, however I am struggling with the upload of the json data.

from google.cloud import bigquery
from google.cloud.bigquery import Dataset
from google.cloud.bigquery import LoadJobConfig
from google.cloud.bigquery import SchemaField

SCHEMA = [
    SchemaField('full_name', 'STRING', mode='required'),
    SchemaField('age', 'INTEGER', mode='required'),
    SchemaField('address', 'RECORD', mode='REPEATED', fields=(
        SchemaField('test', 'STRING', mode='NULLABLE'),
        SchemaField('second','STRING', mode='NULLABLE')
    ))
]

table_ref = client.dataset('TestApartments').table('Test2')

table = bigquery.Table(table_ref, schema=SCHEMA)
table = client.create_table(table)

When trying to upload a very simple JSON to bigquery I get a rather ambiguous error

400 Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the error stream for more details.

Beside the fact that it makes me kinda sad that its giving up on me :), obviously that error description does not really help ... Please find below how I try to upload the JSON and the sampledata.

job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON

with open('testjson.json', 'rb') as source_file:
    job = client.load_table_from_file(
        source_file,
        table_ref,
        location='US',  # Must match the destination dataset location.
        job_config=job_config)  # API request

job.result()  # Waits for table load to complete.

print('Loaded {} rows into {}:{}.'.format(
    job.output_rows, dataset_id, table_id))

This is my JSON object

"[{'full_name':'test','age':2,'address':[{'test':'hi','second':'hi2'}]}]"

A JSON example would be wonderful as this seems the only way to upload nested data if I am not mistaken.

2

2 Answers

3
votes

I have been reproducing your scenario using your same code and the JSON content you shared, and I suspect the issue is only that you are defining the JSON content between quotation marks (" or '), while it should not have that format.

The correct format is the one that @ElliottBrossard already shared with you in his answer:

{'full_name':'test', 'age':2, 'address': [{'test':'hi', 'second':'hi2'}]}

If I run your code using that content in the testjson.json file, I get the response Loaded 1 rows into MY_DATASET:MY_TABLE and the content gets loaded in the table. If, otherwise, I use the format below (which you are using according to your question and the comments in the other answer), I get the result google.api_core.exceptions.BadRequest: 400 Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the error stream for more details.

"{'full_name':'test','age':2,'address':[{'test':'hi','second':'hi2'}]}"

Additionally, you can go to the Jobs page in the BigQuery UI (following the link https://bigquery.cloud.google.com/jobs/YOUR_PROJECT_ID), and there you will find more information about the failed Load job. As an example, when I run your code with the wrong JSON format, this is what I get:

enter image description here

As you will see, here the error message is more relevant:

error message: JSON parsing error in row starting at position 0: Value encountered without start of object

It indicates that the it could not find any valid start of a JSON object (i.e. a bracket { at the very beginning of the object).

TL;DR: remove the quotation marks in your JSON object and the load job should be fine.

1
votes

I think your JSON content should be:

{'full_name':'test','age':2,'address':[{'test':'hi','second':'hi2'}]}

(No brackets.) As an example using the command-line client:

$ echo "{'full_name':'test','age':2,'address':[{'test':'hi','second':'hi2'}]}" \
    > example.json
$ bq query --use_legacy_sql=false \
    "CREATE TABLE tmp_elliottb.JsonExample (full_name STRING NOT NULL, age INT64 NOT NULL, address ARRAY<STRUCT<test STRING, second STRING>>);"
$ bq load --source_format=NEWLINE_DELIMITED_JSON \
    tmp_elliottb.JsonExample example.json
$ bq head tmp_elliottb.JsonExample
+-----------+-----+--------------------------------+
| full_name | age |            address             |
+-----------+-----+--------------------------------+
| test      |   2 | [{"test":"hi","second":"hi2"}] |
+-----------+-----+--------------------------------+