1
votes

I am trying to write to BigQuery using google dataflow. But the data is corrupt because of which the data I am trying to write in a column of a table in BigQuery is not matching the datatype of that column. So the job logs show errors like the one given below:

BigQuery job "dataflow_job_6154554826811450962-B" in project "analytics-repo-dev" finished with error(s): errorResult: JSON table encountered too many errors, giving up. Rows: 1951; errors: 1., error: JSON table encountered too many errors, giving up. Rows: 1951; errors: 1., error: JSON parsing error in row starting at position 32470424514 at file: gs://abhishek_bucket/staging_bucket/6154554826811451567/dax-tmp-2016-08-21_08_10_40-11073983896195810394-S122-1-e7a57fb7a3ac697f/-shard-00000-of-00001-try-2ac1f1192effdfbb-endshard.json. Could not parse '20-259228628. ' as a timestamp. Required format is YYYY-MM-DD HH:MM[:SS[.SSSSSS]] Field: Ts; Value: 20-259228628.

The problem is these are "errors" and not "exceptions". So the moment Dataflow gets these errors, the entire dataflow job stops. So I need a mechanism to "catch" them. Had these messages been "exceptions", I would have handled them separately and the job would not have failed completely.

So is there a way to get the actual exceptions which get raised when i try to write bad data to BigQUery in Google Dataflow?

1

1 Answers

2
votes

This is a known issue with the streaming BigQueryIO sink; unfortunately there's not a good workaround, besides doing validation in your own DoFn before the BigQueryIO.Write. Doing that will allow you to drop, fix, or output to another sink any records that are invalid.