1
votes

I am trying to understand why tcp client stops sending data and waits for the server to respond. I've read about receive and congestion windows and I've set initcwnd to 400 on both endpoints. I also set net.ipv4.tcp_window_scaling to 1. And both sockets are opened with TCP_NODELAY option to disable Nagle algorithm. The RTT latency between endpoints is about 35ms.

It's clear from tcpdump trace below that at 14:02:46.310155 client sends its last packet and then it waits for ack from the server that arrives ~31ms after. Once it arrives it continues sending the data.

14:02:46.268179 IP client > server: Flags [S], seq 2645621234, win 28400, options [mss 1420,sackOK,TS val 6178563 ecr 0,nop,wscale 9], length 0
14:02:46.305282 IP server > client: Flags [S.], seq 339254367, ack 2645621235, win 28160, options [mss 1420,sackOK,TS val 4865788 ecr 6178563,nop,wscale 9], length 0
14:02:46.305343 IP client > server: Flags [.], ack 1, win 56, options [nop,nop,TS val 6178573 ecr 4865788], length 0
14:02:46.305592 IP client > server: Flags [P.], seq 1:44, ack 1, win 56, options [nop,nop,TS val 6178573 ecr 4865788], length 43
14:02:46.305954 IP client > server: Flags [.], seq 44:1452, ack 1, win 56, options [nop,nop,TS val 6178573 ecr 4865788], length 1408
14:02:46.306023 IP client > server: Flags [.], seq 1452:2860, ack 1, win 56, options [nop,nop,TS val 6178573 ecr 4865788], length 1408
14:02:46.306258 IP client > server: Flags [.], seq 2860:4268, ack 1, win 56, options [nop,nop,TS val 6178573 ecr 4865788], length 1408
14:02:46.306445 IP client > server: Flags [.], seq 4268:5676, ack 1, win 56, options [nop,nop,TS val 6178573 ecr 4865788], length 1408
14:02:46.306586 IP client > server: Flags [.], seq 5676:7084, ack 1, win 56, options [nop,nop,TS val 6178573 ecr 4865788], length 1408
14:02:46.306914 IP client > server: Flags [.], seq 7084:8492, ack 1, win 56, options [nop,nop,TS val 6178573 ecr 4865788], length 1408
14:02:46.307082 IP client > server: Flags [.], seq 8492:9900, ack 1, win 56, options [nop,nop,TS val 6178573 ecr 4865788], length 1408
14:02:46.307251 IP client > server: Flags [.], seq 9900:11308, ack 1, win 56, options [nop,nop,TS val 6178573 ecr 4865788], length 1408
14:02:46.307411 IP client > server: Flags [.], seq 11308:12716, ack 1, win 56, options [nop,nop,TS val 6178573 ecr 4865788], length 1408
14:02:46.307620 IP client > server: Flags [.], seq 12716:14124, ack 1, win 56, options [nop,nop,TS val 6178573 ecr 4865788], length 1408
14:02:46.307760 IP client > server: Flags [.], seq 14124:15532, ack 1, win 56, options [nop,nop,TS val 6178573 ecr 4865788], length 1408
14:02:46.307931 IP client > server: Flags [.], seq 15532:16940, ack 1, win 56, options [nop,nop,TS val 6178573 ecr 4865788], length 1408
14:02:46.308059 IP client > server: Flags [.], seq 16940:18348, ack 1, win 56, options [nop,nop,TS val 6178573 ecr 4865788], length 1408
14:02:46.308216 IP client > server: Flags [.], seq 18348:19756, ack 1, win 56, options [nop,nop,TS val 6178573 ecr 4865788], length 1408
14:02:46.308373 IP client > server: Flags [.], seq 19756:21164, ack 1, win 56, options [nop,nop,TS val 6178573 ecr 4865788], length 1408
14:02:46.309622 IP client > server: Flags [.], seq 21164:22572, ack 1, win 56, options [nop,nop,TS val 6178574 ecr 4865788], length 1408
14:02:46.309852 IP client > server: Flags [.], seq 22572:23980, ack 1, win 56, options [nop,nop,TS val 6178574 ecr 4865788], length 1408
14:02:46.310023 IP client > server: Flags [.], seq 23980:25388, ack 1, win 56, options [nop,nop,TS val 6178574 ecr 4865788], length 1408
14:02:46.310155 IP client > server: Flags [.], seq 25388:26796, ack 1, win 56, options [nop,nop,TS val 6178574 ecr 4865788], length 1408
14:02:46.341579 IP server > client: Flags [.], ack 44, win 55, options [nop,nop,TS val 4865797 ecr 6178573], length 0
14:02:46.341612 IP client > server: Flags [.], seq 26796:28204, ack 1, win 56, options [nop,nop,TS val 6178582 ecr 4865797], length 1408
3
Can it be it is not taking rcv window scale into account for some reason? It stops right before passing 28160 Bytes.rodolk
Do you have NAT in the middle? Where did you take tcpdump? In the server or client?rodolk
I took it on the client side. It's on google cloud in different zones. I am not sure how to check if it has NAT. You wrote "From the tcpdump it's clear Nagle is off and the congestion window has been modified from default. I would say it's a different problem." How do you read this from tcpdump. I am not experienced with it - can you give me some tips how do you read it?Roman
Our Devop told me we do not have NAT between instances - it's the same subnetwork "/16". Hope it answers your questionRoman

3 Answers

2
votes

In the three way handshake a window scale of 2^9 is negotiated for the receiver, and the receiver advertises a window of 55 = 55*2^9 = 28160 bytes.

The sender then sends a 43 byte packet, immediately followed by 19 with 1408 bytes for a total of 26795 bytes.

Clearly, the default initial congestion window has be modified, for otherwise the 20 packets would not have been sent without receiving an ACK.

However, the 26795 bytes have nearly filled the receivers advertised window; there is not enough room to send another full MTU.

When the ACK from the receiver finally arrives acknowleging receipt of the 43 byte packet and advertising a window of 55, we know the 43 bytes from the initial packet have been consumed, and we now calculate that there is exactly enough room available to send one more 1408 byte packet (28160-26795+43=1408).

So the problem is that your receiver is not advertising a window big enough to hold the 400*1408 bytes in your initial congestion window. You must similarly adjust the receivers receive window.

Note that if you had been looking at the capture in Wireshark, the "zero window" condition of the receiver would have been highlighted.

(Actually, it is a little more complicated than this. I can't fully explain why it doesn't send a partial MTU to fully fill the advertised receive window. If Nagle's algorithm is enabled, this explains it; you can turn it off by setting socket option TCP_NODELAY. If Nagle is off, it might reflect an implementation detail of slow start for your TCP stack.)

1
votes

There are two main reasons for such behaviour:

  1. Slow-start. Happens when the connection is just established or after a congestion has been detected.
  2. Nagle algorithm. Happens when sending TCP segments shorter than MSS.

In your scenario, the client sends 20 TCP segments and waits for an acknowledgement. Having received one (for the first segment) it sends more (your trace only shows one segment).

0
votes

From the dump of your tcp communication I see the following:

-We cannot assure Nagle is off but since the client is sending full MSS messages, this doesn't seem to be a problem here

-The connection is using window scaling: you can see wscale is 9 in options in the SYN and SYN+ACK messages

However, I see that just before passing 28160 Bytes (window published by server), client stops and waits for server ACK. That could be because it is not taking window scale into account or the initcwnd is in 20 (you said you set it to 400) or the application sent 26796 Bytes and then the rest of the Bytes.