2
votes

I'm trying to load JSON data to BigQuery. The excerpt of my data causing problems looks like that:

{"page":"295"}
{"page":["295", "123"]}

I have defined schema for this field to be:

{
  "name": "page",
  "type": "string",
  "mode": "repeated"
}

I'm getting an error of "Repeated field must be imported as a JSON array." I think the problem is caused by the structure of the JSON itself. The string in the first row should be 1 element array instead of just a string. The data was delivered to me like this. I am trying to find a way to force BigQuery to read this string as a one element array instead of fixing the file (which is hundred gigabytes). Is there any solution?

1

1 Answers

1
votes

One option is to use BigQuery to transform the data itself. Supposing that you are able to import the rows as CSV instead (pick an arbitrary separator that doesn't appear in your data) you can use the JSON_EXTRACT function to retrieve the value(s) of page across rows. For example,

#standardSQL
SELECT JSON_EXTRACT(json, '$.page') AS page
FROM UnprocessedTable;

You can use the SPLIT function or REGEXP_EXTRACT_ALL to retrieve the individual values afterward.

Edit: As a concrete example, you can try this query:

#standardSQL
WITH T AS (
  SELECT '{"page": "foo"}' AS json UNION ALL
  SELECT '{"page": ["bar", "baz"]}' AS json UNION ALL
  SELECT '{"page": ["a", "b", "c"]}' AS json
)
SELECT
  REGEXP_EXTRACT_ALL(JSON_EXTRACT(json, '$.page'), r'"([^"]*)"') AS pages
FROM T;

This returns the JSON arrays (or scalar strings) as an ARRAY<STRING> column.