Determining specific input data which causes a Google Dataflow job to fail

Question

I had an issue where I was using Dataflow to parse a text file and then put relevant data into BQ. The issue was seemingly caused by a single line of malformed input in the text file. I was able to fix the error, but it got me thinking: if I had a line of input that was hosing Dataflow, is there any way I could find out the line? This would make one part of Dataflow debugging much easier, especially if your input file was a few billion lines and you had to track down the one line causing problems.

As an example, let's say I'm posting data think I think is an integer to BigQuery. I might create my schema like this:

List<TableFieldSchema> fields = new ArrayList<>();
    fields.add(newTableFieldSchema().setName("ItemNum").setType("INTEGER"));

And I might map the input data into the BigQuery schema with this function:

    public void processElement(ProcessContext c) {

        TableRow row = new TableRow();
        row.set("ItemNum", c.element()); 
        c.output(row);
    }

But when Dataflow hits my malformed input (where it is not an integer), I get an error like this:

Workflow failed. Causes: (30d455a6f7aaaaaa): BigQuery job "dataflow_job_3518531384490999999" in project "project-name" finished with error(s): job error: Could not convert value to integer (bad value or out of range)., error: Could not convert value to integer (bad value or out of range)., error: Could not convert value to integer (bad value or out of range)., error: Could not convert value to integer (bad value or out of range)., error: Could not convert value to integer (bad value or out of range)., error: Could not convert value to integer (bad value or out of range).

In this particular case I should be verifying my input is an integer as expected before trying to put it into BigQuery (and then logging any data that fails validation). But the general question remains--let's say I want to see the input that caused this error, since (I think) I'm performing all appropriate input validation already and have no idea what sort of malformed data might cause this. How would I do that? I'm thinking some sort of try/catch type trick (possibly involving a log message) could work, but I'm not really sure how to make that happen.

Thanks!

This may not be your problem, but I have a 5GB .csv data file where some records had unnecessary commas and extra quotes within certain fields (e.g. "Cleveland, OH" as opposed to "Cleveland OH"). These were causing Big Query to fail to load the data set. I rewrote the .csv file locally with a python script to remove the extraneous punctuation, before loading the data into google cloud storage and BQ. After doing that everything worked fine. — Paul
My issue was similar. I wasn't doing proper validation checks on the incoming data and had malformed input. I was doing some typecasting that blew up when faced with unexpected input. Never identified the exact error, just know that adding extra logic to the typecasting to handle unexpected input seemed to fix the problem. — Michael Sherman

rf- rf- · Accepted Answer · 2015-03-06T21:22:08

The approach you suggest (using a try/catch, logging your parse errors separately) is a good way to go at the moment. We are actively studying options to equip pipeline writers to handle these type of issues.

Determining specific input data which causes a Google Dataflow job to fail

1 Answers