2
votes

I have a JSON data file which looks something like below

{
    "key_a": "value_a",
    "key_b": "value_b",
    "key_c": {
        "c_nested/invalid.key.according.to.bigquery": "valid_value_though"
    }
}

As we know BigQuery considers c_nested/invalid.key.according.to.bigquery as an invalid column name. I have a huge amount of log data exported by StackDriver into Google Cloud Storage which has a lot of invalid fields (according to BigQuery Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long).

As a workaround, I am trying to store the value to the key_c (the whole {"c_nested/invalid.key.according.to.bigquery": "valid_value_though"} thing) as a string in the BigQuery table.

I presume my table definition would look something like below:

[
    {
        "mode": "NULLABLE", 
        "name": "key_a", 
        "type": "STRING"
    },
    {
        "mode": "NULLABLE", 
        "name": "key_b", 
        "type": "STRING"
    },
    {
        "mode": "NULLABLE", 
        "name": "key_c", 
        "type": "STRING"
    }
]

When I try to create a table with this schema I get the below error:

Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.

Error while reading data, error message: JSON processing encountered too many errors, giving up. Rows: 1; errors: 1; max bad: 0; error percent: 0

Error while reading data, error message: JSON parsing error in row starting at position 0: Expected key

Assuming it is now supported in BigQuery, I thought of simply skipping the key_c column with the below schema:

[
    {
        "mode": "NULLABLE", 
        "name": "key_a", 
        "type": "STRING"
    },
    {
        "mode": "NULLABLE", 
        "name": "key_b", 
        "type": "STRING"
    }
]

The above schema lets me at least create a permanent table (for querying external data), but when I am trying to query the data I get the following error:

Error while reading table: 
projectname.dataset_name.table_name, error message: 
JSON parsing error in row starting at position 0: No such field: key_c.

I understand there is a way described here to load each JSON row raw to BigQuery - as if it was a CSV - and then parse in BigQuery but hat makes the queries too complicated.

Is cleaning the data the only way? How can I tackle this?

I am looking for a way to skip making a column for invalid fields and store then directly as STRING or simply ignore them fully. Is this possible?

1

1 Answers

2
votes

One of the main premise why people use BQ (and other cloud databases) is that storage is cheap. In practice, it is often helpful to load 'raw' or 'source' data into BQ and then transform it as needed (views or other transformation tools). This is a paradigm shift from ETL to ELT.

With that in mind, I would import your "invalid" JSON blob as a string, and then parse it in your transformation steps. Here is one method:

 with data as (select '{"key_a":"value_a","key_b":"value_b","key_c":{"c_nested/invalid.key.according.to.bigquery":"valid_value_though"}}' as my_string)
select 
  JSON_EXTRACT_SCALAR(my_string,'$.key_a') as key_a,
  JSON_EXTRACT_SCALAR(my_string,'$.key_b') as key_b,
  JSON_EXTRACT_SCALAR(REPLACE(my_string,"c_nested/invalid.key.according.to.bigquery","custom_key"),'$.key_c.custom_key') as key_c
from data