2
votes

I'm working in the BigQuery Standard SQL dialect.

I have a column that I know to be an array of JSON dictionaries.

Array length is variable from row to row.

I'd like to flatten this so I can access the JSON elements for each dictionary in the arrays.

For example, let's say I have two records. The first has an id of 1, and in the JSON column, has this

[
    {"key1":"val1a", "key2": "val1b"},
    {"key1":"val1c", "key2": "val1d"}
]

The second has an id of 2, and in the JSON column has

[{"key1":"val2a", "key2":"val2b"}]

My goal is

id | key1  | key2  | offset
---------------------------
1  | val1a | val1b |   1
1  | val1c | val1d |   2
2  | val2a | val2b |   1

(Though I could live without the offset column)

It seems like something like this could work...

WITH table AS (
SELECT 1 as id,['{"key1":"val1a", "key2": "val1b"}','{"key1":"val1c", "key2": "val1d"}'] as array_column
UNION ALL
SELECT 2 as id,['{"key1":"val2a", "key2":"val2b"}'] as array_column)

SELECT id,
    json_extract_scalar(flattened_array, '$.key1') as key1,
    json_extract_scalar(flattened_array, '$.key2') as key2
FROM table t 
CROSS JOIN UNNEST(t.array_column) AS flattened_array

And in fact, that query returns the table I expect (minus the offset column, which is trivial to add)

The problem is that BigQuery doesn't understand that this thing is an array of JSON-like strings. It thinks the whole thing is one big string, and I don't know how to convince it otherwise. Editing my example to simulate this type confusion demonstrates the problem:

WITH table AS (
SELECT 1 as id,'[{"key1":"val1a", "key2": "val1b"},{"key1":"val1c", "key2": "val1d"}]' as array_column
UNION ALL
SELECT 2 as id,'[{"key1":"val2a", "key2":"val2b"}]' as array_column)

SELECT id,
    json_extract_scalar(flattened_array, '$.key1') as key1,
    json_extract_scalar(flattened_array, '$.key2') as key2
FROM table t 
CROSS JOIN UNNEST(t.array_column) AS flattened_array

Here, the validator complains because Values referenced in UNNEST must be arrays. UNNEST contains expression of type STRING at [29:23]

And now we're at the heart of the issue. Is there some obvious way to get BigQuery to understand that this string is a valid array of JSON dictionaries? Maybe some JSON_* function I've overlooked that will flatten the array? Or some way to CAST this thing to an array?

2
If the array has the same structure all the time you should import the json data as an array of structs - transforming it into a format bigquery understands. - Martin Weitzmann

2 Answers

2
votes

You can use BigQuery JavaScript UDFs to parse JSON in any way you'd like:

CREATE TEMP FUNCTION flatten_array(array_column STRING)
RETURNS ARRAY<STRUCT<key1 STRING, key2 STRING>>
LANGUAGE js
AS """
  return JSON.parse(array_column)
""";

WITH table AS (
SELECT 1 as id,'[{"key1":"val1a", "key2": "val1b"},{"key1":"val1c", "key2": "val1d"}]' as array_column
UNION ALL
SELECT 2 as id,'[{"key1":"val2a", "key2":"val2b"}]' as array_column)

SELECT id,
    key1,
    key2
FROM table t 
CROSS JOIN UNNEST(flatten_array(array_column)) AS flattened_array

enter image description here

To get better native BQ JSON arrays support, vote issue 63716683 up, and subscribe for updates.

2
votes

Below is for BigQuery Standard SQL and I recommend using it in case if your json is as simple as in your example

#standardSQL
SELECT id, key1, key2
FROM table,
UNNEST(REGEXP_EXTRACT_ALL(array_column, r'"key1"\s*:\s*"(.*?)"')) key1 WITH OFFSET
JOIN UNNEST(REGEXP_EXTRACT_ALL(array_column, r'"key2"\s*:\s*"(.*?)"')) key2 WITH OFFSET
USING (OFFSET)

you can test, play with above using sample data from your question as in below example

#standardSQL
WITH table AS (
  SELECT 1 AS id,'[{"key1":"val1a", "key2": "val1b"},{"key1":"val1c", "key2": "val1d"}]' AS array_column UNION ALL
  SELECT 2 AS id,'[{"key1":"val2a", "key2":"val2b"}]' AS array_column
)
SELECT id, key1, key2
FROM table,
UNNEST(REGEXP_EXTRACT_ALL(array_column, r'"key1"\s*:\s*"(.*?)"')) key1 WITH OFFSET
JOIN UNNEST(REGEXP_EXTRACT_ALL(array_column, r'"key2"\s*:\s*"(.*?)"')) key2 WITH OFFSET
USING (OFFSET)

with result

Row id  key1    key2     
1   1   val1a   val1b    
2   1   val1c   val1d    
3   2   val2a   val2b   

Not 100% sure, but I feel above is less expensive than UDF - which is still great option :o) especially for more generic cases with more complex json