0
votes

I am writing a complex JSON object collection to BigQuery table from Dataflow using Python. To manually create the table schema like following is way too complicated since my json object is nested with multiple layers.

from apache_beam.io.gcp.internal.clients import bigquery

table_schema=bigquery.TableSchema()

id_schema = bigquery.TableFieldSchema()
id_schema.name = 'ID'
id_schema.type = 'integer'
id_schema.mode = 'nullable'
table_schema.fields.append(id_schema)
...

So I tried the method recommended by Writing nested schema to BigQuery from Dataflow (Python). First I run following command in cloud console to get the schema

bq --format=json show project:dataset.table > output_schema.json

Then I run following code the get the table schema,

table_schema = parse_table_schema_from_json(json.dumps(json.load(open("output_schema.json"))["schema"]))

This works perfectly as expected. And the table was originally created from a Jupyter notebook, where I could use bigquery.LoadJobConfig with autodetect to write to BigQuery without providing schema.

Now I use Apache Beam pipeline try to write to BigQuery with this schema, somehow I got a few errors like:

WARNING:apache_beam.io.gcp.bigquery:There were errors inserting to BigQuery. Will retry. Errors were [<InsertErrorsValueListEntry
 errors: [<ErrorProto
 debugInfo: ''
 location: 'sectiontokens.documents'
 message: 'Array specified for non-repeated field.'
 reason: 'invalid'>]
 index: 0>, <InsertErrorsValueListEntry
 errors: [<ErrorProto
 debugInfo: ''
 location: 'sectiontokens.errors'
 message: 'Array specified for non-repeated field.'
 reason: 'invalid'>]
 index: 1>, <InsertErrorsValueListEntry
 errors: [<ErrorProto
 debugInfo: ''
 location: 'sectiontokens.documents'
 message: 'Array specified for non-repeated field.'
 reason: 'invalid'>]
 index: 2>]

my table schema is:

table_schema = {
  "fields": [
    {"name": "ID", "type": "INTEGER", "mode": "NULLABLE"},
    {"name": "SourceResourceID","type": "STRING","mode": "NULLABLE"},
    {"name": "DocumentText","type": "STRING","mode": "NULLABLE"},
    {"name": "DocumentName","type": "STRING","mode": "NULLABLE"},
    {"name": "EncounterNumber","type": "FLOAT","mode": "NULLABLE"},
    {"name": "EncounterResourceID","type": "STRING","mode": "NULLABLE"},
    {"name": "DocumentId","type": "STRING","mode": "NULLABLE"},
    {"name": "DocumentDate","type": "TIMESTAMP","mode": "NULLABLE"},
    {"name": "SectionTitle","type": "STRING","mode": "NULLABLE"},
    {"name": "SectionHeader","type": "STRING","mode": "NULLABLE"},
    {"name": "SectionText","type": "STRING","mode": "NULLABLE"},
    {"name": "SectionTokens","type": "RECORD","mode": "NULLABLE",
      "fields": [
        {"name": "documents","type": "RECORD","mode": "NULLABLE",
          "fields": [
            {"name": "list","type": "RECORD","mode": "REPEATED",
              "fields": [
                {"name": "item","type": "RECORD","mode": "NULLABLE",
                  "fields": [
                    {"name": "entities","type": "RECORD","mode": "NULLABLE",
                      "fields": [
                        {"name": "list","type": "RECORD","mode": "REPEATED",
                          "fields": [
                            {"name": "item","type": "RECORD","mode": "NULLABLE",
                              "fields": [
                                {"name": "category","type": "STRING","mode": "NULLABLE"},
                                {"name": "confidenceScore","type": "FLOAT","mode": "NULLABLE"},
                                {"name": "id","type": "STRING","mode": "NULLABLE"},
                                {"name": "isNegated","type": "BOOLEAN","mode": "NULLABLE"},
                                {"name": "length","type": "INTEGER","mode": "NULLABLE"},
                                {"name": "links","type": "RECORD","mode": "NULLABLE",
                                  "fields": [
                                    {"name": "list","type": "RECORD","mode": "REPEATED",
                                      "fields": [
                                        {"name": "item","type": "RECORD","mode": "NULLABLE",
                                          "fields": [
                                            {"name": "dataSource","type": "STRING","mode":"NULLABLE"},
                                            {"name": "id","type": "STRING","mode": "NULLABLE"}
                                          ]
                                        }
                                      ]
                                    }
                                  ]
                                },
                                {"name": "offset","type": "INTEGER","mode": "NULLABLE"},
                                {"name": "text","type": "STRING","mode": "NULLABLE"}
                              ]
                            }
                          ]
                        }
                      ]
                    },
                    {"name": "id","type": "STRING","mode": "NULLABLE"},
                    {"name": "relations","type": "RECORD","mode": "NULLABLE",
                      "fields": [
                        {"name": "list","type": "RECORD","mode": "REPEATED",
                          "fields": [
                            {"name": "item","type": "RECORD","mode": "NULLABLE",
                              "fields": [
                                {"name": "bidirectional","type": "BOOLEAN","mode": "NULLABLE"},
                                {"name": "relationType","type": "STRING","mode": "NULLABLE"},
                                {"name": "source","type": "STRING","mode": "NULLABLE"},
                                {"name": "target","type": "STRING","mode": "NULLABLE"}
                              ]
                            }
                          ]
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        },
        {"name": "errors","type": "RECORD","mode": "NULLABLE",
          "fields": [
            {"name": "list","type": "RECORD","mode": "REPEATED",
              "fields": [
                {"name": "item","type": "RECORD","mode": "NULLABLE",
                  "fields": [
                    {"name": "error","type": "RECORD","mode": "NULLABLE",
                      "fields": [
                        {"name": "code","type": "STRING","mode": "NULLABLE"},
                        {"name": "innererror","type": "RECORD","mode": "NULLABLE",
                          "fields": [
                            {"name": "code","type": "STRING","mode": "NULLABLE"},
                            {"name": "message","type": "STRING","mode": "NULLABLE"}
                          ]
                        },
                        {"name": "message","type": "STRING","mode": "NULLABLE"}
                      ]
                    },
                    {"name": "id","type": "STRING","mode": "NULLABLE"}
                  ]
                }
              ]
            }
          ]
        },
        {"name": "modelVersion","type": "STRING","mode": "NULLABLE"}
      ]
    }
  ]
}

And here is some sample data:

{'ID': 123, 'SourceResourceID': 'Resource/3c81b4d2-3ee9-11eb-8bf6-0242ac100303', 'DocumentText': 'EXAM:  CT CHEST IC  \n\n\nPROCEDURE DATE:  12/11/2020  \n', 'DocumentName': 'CT CHEST IC', 'EncounterNumber': None, 'EncounterResourceID': 'Encounter/123', 'DocumentId': '123', 'DocumentDate': '2020-12-15 10:21:00 UTC', 'SectionTitle': 'physical_exam', 'SectionHeader': 'EXAM:', 'SectionText': 'EXAM:  CT CHEST IC  \n\n\nPROCEDURE DATE:  12/11/2020  \n \n\n\n', 'SectionTokens': {'documents': [{'id': '1', 'entities': [{'id': '0', 'offset': 7, 'length': 11, 'text': 'CT CHEST IC', 'category': 'ExaminationName', 'confidenceScore': 0.98, 'isNegated': False}]}], 'errors': [], 'modelVersion': '2020-09-03'}}

Can anyone help out to find out what I did wrong? Thanks.

1

1 Answers

2
votes

In your schema, the sectiontokens.documents and sectiontokens.errors are specified as type RECORD, this means that BigQuery expects to have for that field only one record, but in your data, those keys are actually a list of objects.

If you want to define a column to accept a list of objects, you need to have "mode":"REPEATED" https://cloud.google.com/bigquery/docs/nested-repeated