1
votes

My dataflow pipeline functions as so:

Read from Pubsub
Transform data into rows
Write the rows to bigquery

On, occasion data is passed which fails to insert. That is alright, I know the reason for this failure. But dataflow continuously attempts to insert this data over and over and over and over. I would like to limit the number of retries as it bloats the worker logs with irrelevant information. Therefore making it extremely difficult to troubleshoot what is the problem when the same error repeatedly appears.

When running the pipeline locally I get:

no evaluator registered for Read(PubsubSource)

I would love to be able to test the pipeline locally. But it does not seem that dataflow supports this option with PubSub.

To clear the errors I am left with no other choice than canceling the pipeline and running a new job on the Google Cloud. Which costs time & money. Is there a way to limit the errors? Is there a way to test my pipeline locally? Is there a better approach to debugging the pipeline?

Dataflow UI

Job ID: 2017-02-08_09_18_15-3168619427405502955

1
There is currently no way to limit errors besides exception handling. See The note at the bottom here: cloud.google.com/dataflow/pipelines/… It will indefinitely try to rerun your code. I don't think there's a way to test with Pub/Sub locally, if you look at the examples they usually read from a CSV file for data on a local run. - Idrees Khan
Hypothetically, is it possible to have Dataflow acknowledge a Pubsub message when an exception arises? From what I can tell unless the Pubsub message is successfully processed Cloud Dataflow will never acknowledge the message. - Owen Monagan
That is probably entirely up to what kind of exceptions you are expecting and want to handle. As an example, you can take a look at this blog post cloud.google.com/blog/big-data/2016/01/… In my case I am receiving and parsing JSON messages from Pub/Sub, and if that conversion fails I catch and log the payload for later analysis. - Idrees Khan
I'm taking a look at your job. I'll report back in a few hours. - Pablo
Are you trying to run the job locally using the DirectPipelineRunner? Try running your pipeline locally using InProcessPipelineRunner, which should support streaming. - Pablo

1 Answers

3
votes

To run the pipeline locally with unbounded data sets, on @Pablo's suggestion use the InProcessPipelineRunner.

        dataflowOptions.setRunner(InProcessPipelineRunner.class);

Running the program locally has allowed me to handle errors with exceptions and optimize my workflow rapidly.