2
votes

Running a streaming dataflow pipeline with quite a advanced group by using session windows I run into problems after a couple of hours of running. The job scales up in workers, but later starts getting load of logs with the following

Processing lull for PT7500.005S in state process of ...

The transformation that logs this code is right after the "group by"-block and executes an async HTTP call (using scala.concurrent.{Await/Promise}) to an external service.

Any ideas why this happens? Related to async, scaling or group by strategy?

  • Job ID: 2018-01-29_03_13_40-12789475517328084866
  • SDK: Apache Beam SDK for Java 2.2.0
  • Scio version: 0.4.7
1
This might be related to calling out to the HTTP service asynchronously. I've experienced similar issues related to this. As a test, you can try calling the service synchronously. You won't get nearly as high throughput, but you may be able to determine if the issue is related to the async call.Andrew Nguonly
Could it be that you are overloading the server that you're talking to via HTTP?Pablo
@Andrew: I will surely try this, the reason for me to use async in the first place was both to get better throughput and be able to use a retry-logic for http server errors. Do you have any recommendations for a good substitute for this?Brodin
@Pablo: Well, the throughput is pretty high, but that shouldn't be a problem since the service I talk to is auto scaled to infinity and beyond. However, if I overloaded the service – why would beam act this way?Brodin
@Brodin, one thing that I experimented with was configuring the number of threads used for the ExecutionContextExecutorService. This allowed me to control the number of concurrent requests to the service. If the service became overloaded, I could turn down the number of threads. Unfortunately, there's no good substitute for async calls to services. The alternative is to include the service logic as a transform (i.e. calling out directly to a database). I also experimented with implementing the Dataflow job in Node.js, which is built for async functionality.Andrew Nguonly

1 Answers

1
votes

@jkff comment pointed me in the right direction. First step was to add a timeout to the scala future – which showed me that the "Processing lull" was actually promises which never terminated, thus forcing dataflow to keep them around "forever". Now I get proper future timeout errors, but to no avail since the job is still not going forward. Changed to synchronous calls now, but I am seeing a much lower throughput