0
votes

Simple gcloud dataflow pipeline:

PubsubIO.readStrings().fromSubscription -> Window -> ParDo -> DatastoreIO.v1().write()

When load is applied to the pubsub topic, the messages are read but not acked:

Jul 25, 2017 4:20:38 PM org.apache.beam.sdk.io.gcp.pubsub.PubsubUnboundedSource$PubsubReader stats INFO: Pubsub projects/my-project/subscriptions/my-subscription has 1000 received messages, 950 current unread messages, 843346 current unread bytes, 970 current in-flight msgs, 28367ms oldest in-flight, 1 current in-flight checkpoints, 2 max in-flight checkpoints, 770B/s recent read, 1000 recent received, 0 recent extended, 0 recent late extended, 50 recent ACKed, 990 recent NACKed, 0 recent expired, 898ms recent message timestamp skew, 9224873061464212ms recent watermark skew, 0 recent late messages, 2017-07-25T23:16:49.437Z last reported watermark

What pipeline step should ack the messages?

  • stackdriver dashboard shows that there are some acks but the number of unacked messages stays stable.
  • no error messages in the trace indicating that the message processing failed.
  • entries show up in the datastore
1
What runner are you using? Do you have a job ID from the pipeline you ran?Ben Chambers
I have been using the dataflow managed service runner which gives me a job ID as well as my local runner for development. I am using the java sdk 2 by the way.jean
I have seen occasionally messages RPC timeout DEADLINE_EXCEEDED and tried with a bigger machine and more workers as advised here: cloud.google.com/dataflow/pipelines/… The job numer: 2017-07-27_11_40_13-16176649068898383100 @BenChambersjean
There is also the job id 2017-07-27_14_51_37-4624978117098944513 which has the issue reproduced wand debug logging enabled (--defaultWorkerLogLevel=DEBUG).jean

1 Answers

5
votes

Dataflow will only acknowledge PubSub messages after they are durably committed somewhere else. In a pipeline that consists of PubSub -> ParDo -> 1 or more sinks, this may be delayed by any of the sinks having problems (even if they are being retried, that will slow things down). This is part of ensuring that results seem to be processed effectively-once. See a previous question about when Dataflow acknowledges a message for more details.

One (easy) option to change this behavior is to add a GroupByKey (using a randomly generated key) after the PubSub source and before the sinks. This will cause the messages to be acknowledged earlier, but may perform worse, since PubSub is generally better at holding the unprocessed inputs than the GroupByKey.