2
votes

I have two actors. Each actor is in a different ActorSystem. First caches ActorRef of a second. First actor does:

actorRef.tell(msg, self())

and sends a message to the second actor, which does some processing and replies with

getSender().tell(reply, self())

Problem: Initial tell() from first to second actor sometimes takes 1-3 minutes(!) to deliver the message.

There are no other messages sent in Akka apart from this one meaning that mailboxes are empty - system is serving a single request.

System details:

Application has 500 scheduled actors that poll Amazon SQS with a request (SQS is empty) each 30 seconds (blocking). It has another 330 actors that do nothing in my scenario. All actors are configured with default Akka dispatcher.

Box is Amazon EC2 instance with 2 cores and 8gb RAM. CPU and RAM utilization is <5%. JVM has around 1000 threads.

Initial guess is CPU starvation and context switching from too many threads. BUT is not reproducible locally on my i7 machine with 4 cores even having x10 number of actors which uses 75% of available RAM.

How can I actually find the cause of this problem? Is it possible to profile Akka infrastructure to see what takes this message to spend so much time in transit from one actor to another?

1
I would use a profiler like YourKit or something simpler to take a thread dump and understand how many threads you have and if all of them are blocked. If there is no resources available, your actor won't be able to send the message. Also, I am not sure about your use case but I would recommend a solution where you don't need to block threads.hveiga

1 Answers

1
votes

Context switching from too many threads was a probable source of this problem. To fix it the following configuration was added:

actor {
default-dispatcher {
executor = "fork-join-executor"
fork-join-executor
{ parallelism-min = 8 parallelism-factor = 12.0 parallelism-max = 64 task-peeking-mode = "FIFO" }
}
}

Thus, we increase the number of threads per physical core from 6 to 24. 24 is enough for our application to run smoothly. No starvation observed during regression tests.