Why does the latency of a request-response message pair reduce when increasing rate of messages sent over TCP?

Question

Intro

I have a setup of a client and a server communicating over a TCP connection, and I experience weird latency-behaviour which I cant understand.

Context

The client sends a request message to the server, which responds with a response message to client. I define latency as the time from sending a request message to receipt of a response message. I can send request messages at different rates (throttling the frequency of requests), however I always have at most one outstanding request message at any time. I.e, no concurrent/overlapping request-response message pairs.

I have implemented the sending of request and response message in three ways: first is directly on TCP sockets with my own serialization method etc, second is using gRPC for the communication over RPC using HTTP2, third is using Apache Thrift (an RPC framework similar to gRPC). gRPC is in turn implemented in 4 different client/server types, and for Thrift I have 3 different client/server types.

In all solutions, I experience a decrease in latency when increasing the sending rate of request messages (In gRPC and Thrift a request-response pair is communicated via an RPC method). The best latency is observed when not throttling the request rate at all, but sending a new request as soon as a response is received. Latency is measured using the std::chrono::steady_clock primitive. I have no idea what's causing this. I make sure to warmup the TCP connection (passing TCP slow start phase) by sending 10k request messages before starting the real testing.

How I implement the throttling and measure latency (on client ofc):

double rate;
std::cout << "Enter rate (requests/second):" << std::endl;
std::cin >> rate;
auto interval = std::chrono::microseconds(1000000)/rate;

//warmup-phase is here, but not included in this code.

auto total_lat = std::chrono::microseconds(0);
auto iter_time = start_time;
int i = 0;
for(i = 0; i < 10000; i++){ // send 10k requests.
  iter_time = std::chrono::steady_clock::now();
  RequestType request("ABCDEFGHIJKLMNOPQRSTUVWXYZ");
  ResponseType response;
  auto start = std::chrono::steady_clock::now();
  sendRequest(request); //these looks different depending on gRPC/Thrift/"TCP"
  receiveResponse(&response);
  auto end = std::chrono::steady_clock::now();
  auto dur = std::chrono::duration_cast<std::chrono::microseconds>(end-start);
  total_lat+=dur;
  std::this_thread::sleep_until(iter_time+interval); //throttle the sending..
}
// mean latency: total_lat / i

I run the client/server in separate docker containers using docker-compose, and I also run them in kubernetes cluster. In both cases I experience the same behaviour. I am thinking maybe my throttling/time measureing code is doing stuff that I dont know about/understand.

The TCP sockets are set to TCP_NODELAY in all cases. The servers are single/multithreaded nonblocking/blocking, all kinds of different variations, and the clients are some synchronous, some asynchronous etc. So a lot of variation however the same behavious across them all.

Any ideas out here to what could cause such behaviour?

My two cents: From your description it appears you reach a congestion window size where there is no congestion event happening (yet), or recovery via duplicate fast retransmits is quick because you wait for each response, or equilibrium is attained because of window size limiting factors. This also affects the sender on the remote side in a similar way. Slower sends, slower linear growth of remote congestion window size per response, since you are already past slow_start aggressive growth. — DNT
@DNT Thanks for the input! By what you write first, regarding the congestion window size; do you mean that my congestion window grows slower when sending less frequent requests compared to more frequent requests, and thus a slower requests get more impacted by ACKs? — Rasmus Johansson
Rasmus Johanson: It is hard to say without doing actual measurements and checking all local environmental factors. Congestion window grows linearly after slow_start and this rate stops when a congestion or loss event happens and is rectified by algorithms within the stack such as fast retransmit. As the window grows, more data can be sent out until size limits are reached or an adverse event happens in which case we have exponential reduction up to initial size which depends on implementation defaults and parameters. What Liam Kelly points out can also be a factor in the case you describe — DNT
Aside: For some reason SO does not want to add an at sign before your name :) — DNT
@DNT Thank you for your time and input :) I have taken some data communication and computer networking courses, but have limited practical experience :) — Rasmus Johansson

Liam Kelly Liam Kelly · Accepted Answer · 2020-06-05T14:50:36

Right now I think the latency issue is not in the network stack, but the rate in which you are generating and receive messages.

Your test code does not appear to have any real-time assurances, which also need to be set in the container. This means that your 'for loop' does not run at the same speed every time. The OS scheduler can stop it to run other processes (this is how processes share the CPU). This behavior can get even more complicated with containerization mechanisms.

While there are mechanisms in TCP which can cause latency variations (as mentioned by @DNT), I don't think you would be seeing them. Especially if the server and client are local. This is why I would rule out the rate of message generation and reception first before looking at the TCP stack.

Why does the latency of a request-response message pair reduce when increasing rate of messages sent over TCP?

Intro

Context

1 Answers