2
votes

I'm experiencing issues scaling my app with multiple requests.

Each request sends an ask to an actor, which then spawns other actors. This is fine, however, under load(5+ asks at once), the ask takes a massive amount of time to deliver the message to the target actor. The original design was to bulkhead requests evenly, but this is causing a bottleneck. Example:

enter image description here

In this picture, the ask is sent right after the query plan resolver. However, there is a multi-second gap when the Actor receives this message. This is only experienced under load(5+ requests/sec). I first thought this was a starvation issue.

Design: Each planner-executor is a seperate instance for each request. It spawns a new 'Request Acceptor' actor each time(it logs 'requesting score' when it receives a message).

  • I gave the actorsystem a custom global executor(big one). I noticed the threads were not utilized beyond the core threadpool size even during this massive delay
  • I made sure all executioncontexts in the child actors used the correct executioncontext
  • Made sure all blocking calls inside actors used a future
  • I gave the parent actor(and all child) a custom dispatcher with core size 50 and max size 100. It did not request more(it stayed at 50) even during these delays
  • Finally, I tried creating a totally new Actorsystem for each request(inside the planner-executor). This also had no noticable effect!

I'm a bit stumped by this. From these tests it does not look like a thread starvation issue. Back at square one, I have no idea why the message takes longer and longer to deliver the more concurrent requests I make. The Zipkin trace before reaching this point does not degrade with more requests until it reaches the ask here. Before then, the server is able to handle multiple steps to e.g veify the request, talk to the db, and then finally go inside the planner-executor. So I doubt the application itself is running out of cpu time.

1
Do you really need to spawn a new planner-executor and "Request Acceptor" actor for each request? You might want to use router with dedicated dispatcher. Updating your question with reproducable code could give us a better picture.Branislav Lazic
In the original design, it helped keep things clean. Is this considered an anti-pattern? If not, are you suggesting that re-using the actors might give performance benefits?stan

1 Answers

1
votes

We had this very similar issue with Akka. We observed huge delay in ask pattern to deliver messages to the target actor on peek load.

Most of these issues are related to heap memory consumption and not because of usages of dispatchers.

Finally we fixed these issues by tuning some of the below configuration and changes.

1) Make sure you stop entities/actors which are no longer required. If its a persistent actor then you can always bring it back when you need it. Refer : https://doc.akka.io/docs/akka/current/cluster-sharding.html#passivation

2) If you are using cluster sharding then check the akka.cluster.sharding.state-store-mode. By changing this to persistence we gained 50% more TPS.

3) Minimize your log entries (set it to info level).

4) Tune your logs to publish messages frequently to your logging system. Update the batch size, batch count and interval accordingly. So that the memory is freed. In our case huge heap memory is used for buffering the log messages and send in bulk. If the interval is more then you may fill your heap memory and that affects the performance (more GC activity required).

5) Run blocking operations on a separate dispatcher.

6) Use custom serializers (protobuf) and avoid JavaSerializer.

7) Add the below JAVA_OPTS to your jar

export JAVA_OPTS="$JAVA_OPTS -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XX:MaxRAMFraction=2 -Djava.security.egd=file:/dev/./urandom"

The main thing is XX:MaxRAMFraction=2 which will utilize more than 60% of available memory. By default its 4 means your application will use only one fourth of the available memory, which might not be sufficient.

Refer : https://blog.csanchez.org/2017/05/31/running-a-jvm-in-a-container-without-getting-killed/

Regards,

Vinoth