0
votes

I'm looking for guidelines to maximize throughput and minimize latency for gRPC unary calls. I need to achieve about 20,000 QPS, < 50ms each. On a moderate hardware (4 core CPU) I could only achieve about 15K QPS with an average latency of 200ms. I'm using Java client and server. The server does nothing except return a response. The client sends multiple concurrent requests using an async stub. The number of concurrent requests is limited.CPU remains in the ~80% range. In comparison, using Apache Kafka I can achieve much higher throughput (100's of thousands QPS), as well as latency in the 10ms range.

1
gRPC's benchmarking suite shows 195µs latency (for single RPC) and 245k QPS for an 8 core GCE VM, using TLS. There are a lot of factors that can impact results, but the most basic are the type of benchmark, amount of warmup, and the network. Your numbers are so far away from expected, I would generally assume this is because there was no warmup period to give time for the JIT to optimize the code, but that's shooting in the dark. - Eric Anderson
Thanks @EricAnderson. I'm now running both client/server on AWS r4.2xlarge instances. I'm getting much better throughput - 50K (after warmup) with low latencies as expected. The client is single threaded. Both client/server instances are at ~200% CPU (two threads utilized). How can I squeeze out more performance? - Daniel Nitzan

1 Answers

2
votes

If you are using grpc-java 1.21 or later and grpc-netty-shaded you should already be using the Netty Epoll transport. If you are using grpc-netty, add a runtime dependency on io.netty:netty-transport-native-epoll (the correct version can be found by looking at grpc-netty's pom.xml or by the version table in SECURITY.md).

The default executor for callbacks is a "cached thread pool." If you do not block (or know the limits of your blocking), specifying a fixed-size thread pool can increase performance. You can try both Executors.newFixedThreadPool and ForkJoinPool; we've seen the "optimal" choice vary depending on the work load. You specify your own executor via ServerBuilder.executor() and ManagedChannelBuilder.executor().

If you have high throughput (~Gbps+ per client with TLS; higher if plaintext) using multiple Channels can improve performance by using multiple TCP connections. Each TCP connection is pinned to a Thread, so having more TCP connections allows using more Threads. You can create the multiple Channels and then round-robin over them; selecting a different one for each RPC. Note that you can easily implement the Channel interface to "hide" this complexity from the rest of your application. This looks like it would provide you specifically with a large gain, but I put it last because it's commonly not necessary.