I am writing a complex JSON object collection to BigQuery table from Dataflow using Python. To manually create the table schema like following is way too complicated since my json object is nested with multiple layers.
from apache_beam.io.gcp.internal.clients import bigquery
table_schema=bigquery.TableSchema()
id_schema = bigquery.TableFieldSchema()
id_schema.name = 'ID'
id_schema.type = 'integer'
id_schema.mode = 'nullable'
table_schema.fields.append(id_schema)
...
So I tried the method recommended by Writing nested schema to BigQuery from Dataflow (Python). First I run following command in cloud console to get the schema
bq --format=json show project:dataset.table > output_schema.json
Then I run following code the get the table schema,
table_schema = parse_table_schema_from_json(json.dumps(json.load(open("output_schema.json"))["schema"]))
This works perfectly as expected. And the table was originally created from a Jupyter notebook, where I could use bigquery.LoadJobConfig with autodetect to write to BigQuery without providing schema.
Now I use Apache Beam pipeline try to write to BigQuery with this schema, somehow I got a few errors like:
WARNING:apache_beam.io.gcp.bigquery:There were errors inserting to BigQuery. Will retry. Errors were [<InsertErrorsValueListEntry
errors: [<ErrorProto
debugInfo: ''
location: 'sectiontokens.documents'
message: 'Array specified for non-repeated field.'
reason: 'invalid'>]
index: 0>, <InsertErrorsValueListEntry
errors: [<ErrorProto
debugInfo: ''
location: 'sectiontokens.errors'
message: 'Array specified for non-repeated field.'
reason: 'invalid'>]
index: 1>, <InsertErrorsValueListEntry
errors: [<ErrorProto
debugInfo: ''
location: 'sectiontokens.documents'
message: 'Array specified for non-repeated field.'
reason: 'invalid'>]
index: 2>]
my table schema is:
table_schema = {
"fields": [
{"name": "ID", "type": "INTEGER", "mode": "NULLABLE"},
{"name": "SourceResourceID","type": "STRING","mode": "NULLABLE"},
{"name": "DocumentText","type": "STRING","mode": "NULLABLE"},
{"name": "DocumentName","type": "STRING","mode": "NULLABLE"},
{"name": "EncounterNumber","type": "FLOAT","mode": "NULLABLE"},
{"name": "EncounterResourceID","type": "STRING","mode": "NULLABLE"},
{"name": "DocumentId","type": "STRING","mode": "NULLABLE"},
{"name": "DocumentDate","type": "TIMESTAMP","mode": "NULLABLE"},
{"name": "SectionTitle","type": "STRING","mode": "NULLABLE"},
{"name": "SectionHeader","type": "STRING","mode": "NULLABLE"},
{"name": "SectionText","type": "STRING","mode": "NULLABLE"},
{"name": "SectionTokens","type": "RECORD","mode": "NULLABLE",
"fields": [
{"name": "documents","type": "RECORD","mode": "NULLABLE",
"fields": [
{"name": "list","type": "RECORD","mode": "REPEATED",
"fields": [
{"name": "item","type": "RECORD","mode": "NULLABLE",
"fields": [
{"name": "entities","type": "RECORD","mode": "NULLABLE",
"fields": [
{"name": "list","type": "RECORD","mode": "REPEATED",
"fields": [
{"name": "item","type": "RECORD","mode": "NULLABLE",
"fields": [
{"name": "category","type": "STRING","mode": "NULLABLE"},
{"name": "confidenceScore","type": "FLOAT","mode": "NULLABLE"},
{"name": "id","type": "STRING","mode": "NULLABLE"},
{"name": "isNegated","type": "BOOLEAN","mode": "NULLABLE"},
{"name": "length","type": "INTEGER","mode": "NULLABLE"},
{"name": "links","type": "RECORD","mode": "NULLABLE",
"fields": [
{"name": "list","type": "RECORD","mode": "REPEATED",
"fields": [
{"name": "item","type": "RECORD","mode": "NULLABLE",
"fields": [
{"name": "dataSource","type": "STRING","mode":"NULLABLE"},
{"name": "id","type": "STRING","mode": "NULLABLE"}
]
}
]
}
]
},
{"name": "offset","type": "INTEGER","mode": "NULLABLE"},
{"name": "text","type": "STRING","mode": "NULLABLE"}
]
}
]
}
]
},
{"name": "id","type": "STRING","mode": "NULLABLE"},
{"name": "relations","type": "RECORD","mode": "NULLABLE",
"fields": [
{"name": "list","type": "RECORD","mode": "REPEATED",
"fields": [
{"name": "item","type": "RECORD","mode": "NULLABLE",
"fields": [
{"name": "bidirectional","type": "BOOLEAN","mode": "NULLABLE"},
{"name": "relationType","type": "STRING","mode": "NULLABLE"},
{"name": "source","type": "STRING","mode": "NULLABLE"},
{"name": "target","type": "STRING","mode": "NULLABLE"}
]
}
]
}
]
}
]
}
]
}
]
},
{"name": "errors","type": "RECORD","mode": "NULLABLE",
"fields": [
{"name": "list","type": "RECORD","mode": "REPEATED",
"fields": [
{"name": "item","type": "RECORD","mode": "NULLABLE",
"fields": [
{"name": "error","type": "RECORD","mode": "NULLABLE",
"fields": [
{"name": "code","type": "STRING","mode": "NULLABLE"},
{"name": "innererror","type": "RECORD","mode": "NULLABLE",
"fields": [
{"name": "code","type": "STRING","mode": "NULLABLE"},
{"name": "message","type": "STRING","mode": "NULLABLE"}
]
},
{"name": "message","type": "STRING","mode": "NULLABLE"}
]
},
{"name": "id","type": "STRING","mode": "NULLABLE"}
]
}
]
}
]
},
{"name": "modelVersion","type": "STRING","mode": "NULLABLE"}
]
}
]
}
And here is some sample data:
{'ID': 123, 'SourceResourceID': 'Resource/3c81b4d2-3ee9-11eb-8bf6-0242ac100303', 'DocumentText': 'EXAM: CT CHEST IC \n\n\nPROCEDURE DATE: 12/11/2020 \n', 'DocumentName': 'CT CHEST IC', 'EncounterNumber': None, 'EncounterResourceID': 'Encounter/123', 'DocumentId': '123', 'DocumentDate': '2020-12-15 10:21:00 UTC', 'SectionTitle': 'physical_exam', 'SectionHeader': 'EXAM:', 'SectionText': 'EXAM: CT CHEST IC \n\n\nPROCEDURE DATE: 12/11/2020 \n \n\n\n', 'SectionTokens': {'documents': [{'id': '1', 'entities': [{'id': '0', 'offset': 7, 'length': 11, 'text': 'CT CHEST IC', 'category': 'ExaminationName', 'confidenceScore': 0.98, 'isNegated': False}]}], 'errors': [], 'modelVersion': '2020-09-03'}}
Can anyone help out to find out what I did wrong? Thanks.