Flatten BigQuery String which resembles an array

Question

I'm working in the BigQuery Standard SQL dialect.

I have a column that I know to be an array of JSON dictionaries.

Array length is variable from row to row.

I'd like to flatten this so I can access the JSON elements for each dictionary in the arrays.

For example, let's say I have two records. The first has an id of 1, and in the JSON column, has this

[
    {"key1":"val1a", "key2": "val1b"},
    {"key1":"val1c", "key2": "val1d"}
]

The second has an id of 2, and in the JSON column has

[{"key1":"val2a", "key2":"val2b"}]

My goal is

id | key1  | key2  | offset
---------------------------
1  | val1a | val1b |   1
1  | val1c | val1d |   2
2  | val2a | val2b |   1

(Though I could live without the offset column)

It seems like something like this could work...

WITH table AS (
SELECT 1 as id,['{"key1":"val1a", "key2": "val1b"}','{"key1":"val1c", "key2": "val1d"}'] as array_column
UNION ALL
SELECT 2 as id,['{"key1":"val2a", "key2":"val2b"}'] as array_column)

SELECT id,
    json_extract_scalar(flattened_array, '$.key1') as key1,
    json_extract_scalar(flattened_array, '$.key2') as key2
FROM table t 
CROSS JOIN UNNEST(t.array_column) AS flattened_array

And in fact, that query returns the table I expect (minus the offset column, which is trivial to add)

The problem is that BigQuery doesn't understand that this thing is an array of JSON-like strings. It thinks the whole thing is one big string, and I don't know how to convince it otherwise. Editing my example to simulate this type confusion demonstrates the problem:

WITH table AS (
SELECT 1 as id,'[{"key1":"val1a", "key2": "val1b"},{"key1":"val1c", "key2": "val1d"}]' as array_column
UNION ALL
SELECT 2 as id,'[{"key1":"val2a", "key2":"val2b"}]' as array_column)

SELECT id,
    json_extract_scalar(flattened_array, '$.key1') as key1,
    json_extract_scalar(flattened_array, '$.key2') as key2
FROM table t 
CROSS JOIN UNNEST(t.array_column) AS flattened_array

Here, the validator complains because Values referenced in UNNEST must be arrays. UNNEST contains expression of type STRING at [29:23]

And now we're at the heart of the issue. Is there some obvious way to get BigQuery to understand that this string is a valid array of JSON dictionaries? Maybe some JSON_* function I've overlooked that will flatten the array? Or some way to CAST this thing to an array?

If the array has the same structure all the time you should import the json data as an array of structs - transforming it into a format bigquery understands. — Martin Weitzmann

Felipe Hoffa Felipe Hoffa · Accepted Answer · 2019-05-16T00:30:59

You can use BigQuery JavaScript UDFs to parse JSON in any way you'd like:

CREATE TEMP FUNCTION flatten_array(array_column STRING)
RETURNS ARRAY<STRUCT<key1 STRING, key2 STRING>>
LANGUAGE js
AS """
  return JSON.parse(array_column)
""";

WITH table AS (
SELECT 1 as id,'[{"key1":"val1a", "key2": "val1b"},{"key1":"val1c", "key2": "val1d"}]' as array_column
UNION ALL
SELECT 2 as id,'[{"key1":"val2a", "key2":"val2b"}]' as array_column)

SELECT id,
    key1,
    key2
FROM table t 
CROSS JOIN UNNEST(flatten_array(array_column)) AS flattened_array

To get better native BQ JSON arrays support, vote issue 63716683 up, and subscribe for updates.

Flatten BigQuery String which resembles an array

2 Answers