I'm trying to load JSON events data from Google Cloud Storage to BigQuery, and I want to load everything as strings and then cast them later, since for some messages they look like this:
{"id":"123"}
{"id":234}
The schema.json I wrote is:
[
{
"name": "id",
"type": "STRING",
"mode": "NULLABLE"
}
]
and I load it with the bq cli:
LOC="--location US"
INPUT=sample.json
SCHEMA=schema.json
bq $LOC load \
--source_format=NEWLINE_DELIMITED_JSON \
--ignore_unknown_values \
--schema=$SCHEMA \
nov2020.test \
$INPUT
It would fail with this error:
Failure details:
- Error while reading data, error message: JSON processing encountered too many errors, giving up. Rows: 2; errors: 1; max bad: 0; error percent: 0
- Error while reading data, error message: JSON parsing error in row starting at position 13: Could not convert value to string. Field: id; Value: 234
I don't want to skip these records with --max_bad_records, and I thought by not using autodetect, I would be able to read everything as strings.
I have about 80GBs of these JSON files every day that I want to process daily, so what can I do to deal with this error? Is my only option to go through each JSON message and format the id field before loading to BigQuery?
I found another post saying to use a plug in for fluentd, but I'm not sure if that applies here since my data was not created using fluentd.
Thanks!