0
votes

I am working on a dataflow pipeline written in python2.7 using apache_beam==2.24.0 . The work of the pipeline is to consume pubsub messages from a subscription using beam's ReadFromPubSub in batches, do some processing on the messages and then to persist the resultant data to two different bigquery tables. There is a lot of data that I am consuming. Google-cloud-pubsub version is 1.7.0 . After running the pipeline everything works fine but after a few hours I start getting the exception:

org.apache.beam.vendor.grpc.v1p13p1.io.grpc.StatusRuntimeException: CANCELLED: call already cancelled

On gcp dataflow console, the logs show this error but the job in itself seems to work fine. It consumes data from the subscription and writes it to bigquery. What CANCELLED: call is being referred to here and why am I getting this error? How can I resolve this?

Full stacktrace:

Caused by: org.apache.beam.vendor.grpc.v1p26p0.io.grpc.StatusRuntimeException: CANCELLED: call already cancelled
    org.apache.beam.vendor.grpc.v1p26p0.io.grpc.Status.asRuntimeException(Status.java:524)
    org.apache.beam.vendor.grpc.v1p26p0.io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:341)
    org.apache.beam.sdk.fn.stream.DirectStreamObserver.onNext(DirectStreamObserver.java:98)
    org.apache.beam.sdk.fn.data.BeamFnDataSizeBasedBufferingOutboundObserver.flush(BeamFnDataSizeBasedBufferingOutboundObserver.java:100)
    org.apache.beam.runners.dataflow.worker.fn.data.RemoteGrpcPortWriteOperation.shouldWait(RemoteGrpcPortWriteOperation.java:124)
    org.apache.beam.runners.dataflow.worker.fn.data.RemoteGrpcPortWriteOperation.maybeWait(RemoteGrpcPortWriteOperation.java:167)
    org.apache.beam.runners.dataflow.worker.fn.data.RemoteGrpcPortWriteOperation.process(RemoteGrpcPortWriteOperation.java:196)
    org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
    org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn$1.output(GroupAlsoByWindowsParDoFn.java:182)
    org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner$1.outputWindowedValue(GroupAlsoByWindowFnRunner.java:108)
    org.apache.beam.runners.dataflow.worker.StreamingGroupAlsoByWindowReshuffleFn.processElement(StreamingGroupAlsoByWindowReshuffleFn.java:57)
    org.apache.beam.runners.dataflow.worker.StreamingGroupAlsoByWindowReshuffleFn.processElement(StreamingGroupAlsoByWindowReshuffleFn.java:39)
    org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.invokeProcessElement(GroupAlsoByWindowFnRunner.java:121)
    org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.processElement(GroupAlsoByWindowFnRunner.java:73)
    org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn.processElement(GroupAlsoByWindowsParDoFn.java:134)
    org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
    org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
    org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:201)
    org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159)
    org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
    org.apache.beam.runners.dataflow.worker.fn.control.BeamFnMapTaskExecutor.execute(BeamFnMapTaskExecutor.java:123)
    org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1365)
    org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1100(StreamingDataflowWorker.java:154)
    org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$7.run(StreamingDataflowWorker.java:1085)
    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    java.lang.Thread.run(Thread.java:748)
1
A streaming pipeline retries failed elements indefinitely. As long as the system latency and data freshness is normal, you don't need to worry about the low level errors. This seems to be some common grpc error: stackoverflow.com/questions/57110811/…. You mentioned using Python SDK, the stacktrace is in Java. Did you use some xlang feature?大ドア東
I am just using apache-beam's python sdk. The sdk might be using some xlang feature internally.Aman Goel
The errors shouldn't cause much trouble. Also, could you please try using Python3 and newer versions of Beam? There could be some grpc issues that are fixed now.大ドア東
The project is using python's version 2.7 only and beam==2,24 is the last supported version for python2.7 . Though the pipeline is using "Dataflow" runner but is it possible that the bash process (which is used to run the python pipeline) going to sleep might be causing the issue? Yesterday I was monitoring the pipeline for 10 straight hours and didn't get the error but generally error comes within 3 hours of starting the pipeline.Aman Goel
A bash script shouldn't cause this since it's running on Dataflow. A similar issue was reported here but marked as not a bug: issues.apache.org/jira/browse/BEAM-9630. This doesn't seem to be an issue and you can probably ignore it. I also added a comment asking about it in that ticket.大ドア東

1 Answers

0
votes

The client I am working for has option for raising request ticket for Google Cloud Support. The exact reply from Google Cloud Support:

This error you are finding is rather harmless. The dataflow is a massively parallel data processing platform and when there are autoscaling events which can move the worker VM around. When the VM is getting shut down the grpc channel is closed before the runner process and the work item being processed will be retried on another newly launched runner. These errors can be ignored.