1
votes

I have this funny problem and I am unable to identify the issue.

I have this "simple" akka application. It's main goal is to go over every document in a database. My main actor requests rows to a single actor that communicates with the database. Each document retrieved is returned to my main actor. By batches, these documents are added to a message queue managed by a balancing dispatcher. Small workers will go over them and sort them.

After a few hours, usually between 2 and 4, all actors stop at the same time, up to 5 seconds interval.

I was wondering if any of you had every seen something similar.

For information:

  • I use AkkA 2.2.0
  • No ask messages are used, only tell
  • I do not use any thread locking methods like Await
  • DeadLetters are the reason I know that everything just shuts down

Thank you for your help


From the DeadLetters it seems that only the actors related to my balancing dispatcher / my round robin router just stop. Would there be something I missed ?

My scala

val workers: ActorRef = context.system.actorOf(
  Props(new WorkerActor)
    .withRouter(FromConfig())
    .withDispatcher("balancing-dispatcher"),
  "round-robin"
)

My configuration code

balancing-dispatcher {
  type = BalancingDispatcher
  executor = "fork-join-executor"
}

akka.actor.deployment {
  /round-robin {
    router = round-robin
    nr-of-instances = 50
    resizer {
      lower-bound = 10
      upper-bound = 100
    }
  }
}
1
that sounds like a fun one to debug... Any exceptions in your logs (If you don't have good logging, now would be the time to add it)?James Adam
Speculation but perhaps something happens at this moment like java memory clean-up in the VM or something in the database, that then causes just enough delay to affect and collapse other time-out values (ie. AwaitResult) you may have programmed. Try pushing all of the time-out values you have way up (for experiments sake) and see if it survives?LaloInDublin
Did you profile your solution? I mean are you sure that DB, CPU, network, I/O is not overloaded? I've solved similar issue before a time. After 2 hours of overloading of our system in production it almost stopped. It seemed like MS SQL started to recalculate clustered index after such time as our app added rows, read them and later remove. So I would recognize which part of application is involved using resource monitoring.Martin Podval
Thanks for your advices.Philippe
@James, I do not have perfect logging because I do not understand what should be logged. I had an output for every message sent, but that was replaced by a DeadLetter listener that isn't much more helpful.Philippe

1 Answers

1
votes

I would first use a profiling tool such as jconsole or jvisualvm to check for memory, GC, and/or fork+join issues. Do you have enough heap allocated? Also record the number of threads and thread states (are there threads being forked or joined when the slow-down occurs?)

It could be that you need to configure more threads in Akka's thread pool. Or that you've reached your upper limit of 100 instances and that they are all busy. You could subclass the DefaultResizer implementation to provide explicit notification/logging of resizer activity and configure your subclass as the resizer.