Running a streaming dataflow pipeline with quite a advanced group by using session windows I run into problems after a couple of hours of running. The job scales up in workers, but later starts getting load of logs with the following
Processing lull for PT7500.005S in state process of ...
The transformation that logs this code is right after the "group by"-block and executes an async HTTP call (using scala.concurrent.{Await/Promise}
) to an external service.
Any ideas why this happens? Related to async, scaling or group by strategy?
- Job ID: 2018-01-29_03_13_40-12789475517328084866
- SDK: Apache Beam SDK for Java 2.2.0
- Scio version: 0.4.7
ExecutionContextExecutorService
. This allowed me to control the number of concurrent requests to the service. If the service became overloaded, I could turn down the number of threads. Unfortunately, there's no good substitute for async calls to services. The alternative is to include the service logic as a transform (i.e. calling out directly to a database). I also experimented with implementing the Dataflow job in Node.js, which is built for async functionality. – Andrew Nguonly