Pubsub to BQ Dataflow template is not parsing RECORD type data

Question

I am using this Dataflow template "Pub/Sub Topic to BigQuery" to parse json schema with RECORD type data structure. Sample Example :

{
"url":"/i?session_duration=61&app_key=123456&device_id=gdfttyty&sdk_name=javascript_native_web&sdk_version=18.04",
"body":
    {
    "session_duration":"61",
    "app_key":"eyrttyuyyu78jkjk",
    "device_id":"h1bh41yptik1vtwr8",
    "sdk_name":"javascript_native_web",
    "sdk_version":"18.04",
    "timestamp":"1597057884636",
    "hour":"10",
    "dow":"1"
    },
"app_key":"eyrttyuyyu78jkjk",
"timestamp":"1597057884636",
"ip_address":"0.0.0.0"
}

Schema Defined in BigQuery is as :

[

   {
      "name":"url",
      "type":"STRING",
      "mode":"NULLABLE"
   },
   {
      "name":"body",
      "type":"RECORD",
      "mode":"REPEATED",
      "fields":[
         {
            "name":"session_duration",
            "type":"STRING",
            "mode":"NULLABLE"
         },
         {
            "name":"app_key",
            "type":"STRING",
            "mode":"NULLABLE"
         },
         {
            "name":"device_id",
            "type":"STRING",
            "mode":"NULLABLE"
         }, 
         {
            "name":"sdk_name",
            "type":"STRING",
            "mode":"NULLABLE"
         },
         {
            "name":"sdk_version",
            "type":"STRING",
            "mode":"NULLABLE"
         }, 
         {
            "name":"timestamp",
            "type":"TIMESTAMP",
            "mode":"NULLABLE"
         }, 
         {
            "name":"hour",
            "type":"TIME",
            "mode":"NULLABLE"
         },      
         {
            "name":"dow",
            "type":"STRING",
            "mode":"NULLABLE"
         }

      ]
   },
   {
      "name":"app_key",
      "type":"STRING",
      "mode":"NULLABLE"
   },
   {
      "name":"timestamp",
      "type":"STRING",
      "mode":"NULLABLE"
   },
   {
      "name":"ip_address",
      "type":"STRING",
      "mode":"NULLABLE"
   }
]

Error Message:

{"errors":[{"debugInfo":"","location":"","message":"Repeated record added outside of an array.","reason":"invalid"}],"index":0}

If I parse data without RECORD type , it gets parsed correctly and in appropriate bigquery table but with RECORD type it gets ingested to bq generated <error_records> table.

itroulli itroulli · Accepted Answer · 2020-08-10T19:48:45

I managed to successfully insert your sample into BigQuery using the Dataflow Pub/Sub to Bigquery template by applying some modifications:

I included the repeated field in an array by putting it inside square brackets [...]
The body.timestamp value is invalid. You can read here about the difference between the BigQuery TIMESTAMP data type and the UNIX timestamp. You have some options on how to handle this depending on what you want to do with this timestamp. If you don't need it for analysis you can easily change the data type of the field to INT64 or STRING as you you have done with the timestamp column of the table.

So the message should be like this:

{
"url":"/i?session_duration=61&app_key=123456&device_id=gdfttyty&sdk_name=javascript_native_web&sdk_version=18.04",
"body": [
    {
    "session_duration":"61",
    "app_key":"eyrttyuyyu78jkjk",
    "device_id":"h1bh41yptik1vtwr8",
    "sdk_name":"javascript_native_web",
    "sdk_version":"18.04",
    "timestamp":"1597057884636",
    "hour":"10",
    "dow":"1"
    }],
"app_key":"eyrttyuyyu78jkjk",
"timestamp":"1597057884636",
"ip_address":"0.0.0.0"
}

and the schema with the changed data type for the body.timestamp field like this:

[

   ...
   
   ,
   {
      "name":"body",
      "type":"RECORD",
      "mode":"REPEATED",
      "fields":[
         
         ...

         , 
         {
            "name":"timestamp",
            "type":"STRING",
            "mode":"NULLABLE"
         }, 
         
         ...
         
      ]
   },
   
   ...
   
]

Pubsub to BQ Dataflow template is not parsing RECORD type data

1 Answers