Unable to catch exceptions when writing to BigQuery using Google dataflow

Question

I am trying to write to BigQuery using google dataflow. But the data is corrupt because of which the data I am trying to write in a column of a table in BigQuery is not matching the datatype of that column. So the job logs show errors like the one given below:

BigQuery job "dataflow_job_6154554826811450962-B" in project "analytics-repo-dev" finished with error(s): errorResult: JSON table encountered too many errors, giving up. Rows: 1951; errors: 1., error: JSON table encountered too many errors, giving up. Rows: 1951; errors: 1., error: JSON parsing error in row starting at position 32470424514 at file: gs://abhishek_bucket/staging_bucket/6154554826811451567/dax-tmp-2016-08-21_08_10_40-11073983896195810394-S122-1-e7a57fb7a3ac697f/-shard-00000-of-00001-try-2ac1f1192effdfbb-endshard.json. Could not parse '20-259228628. ' as a timestamp. Required format is YYYY-MM-DD HH:MM[:SS[.SSSSSS]] Field: Ts; Value: 20-259228628.

The problem is these are "errors" and not "exceptions". So the moment Dataflow gets these errors, the entire dataflow job stops. So I need a mechanism to "catch" them. Had these messages been "exceptions", I would have handled them separately and the job would not have failed completely.

So is there a way to get the actual exceptions which get raised when i try to write bad data to BigQUery in Google Dataflow?

danielm danielm · Accepted Answer · 2016-08-22T16:44:08

This is a known issue with the streaming BigQueryIO sink; unfortunately there's not a good workaround, besides doing validation in your own DoFn before the BigQueryIO.Write. Doing that will allow you to drop, fix, or output to another sink any records that are invalid.

Unable to catch exceptions when writing to BigQuery using Google dataflow

1 Answers