2
votes

Trying to solve an issue where there is a significant amount of latency on outgoing messages that seems to be related to the socket flush behaviour. I've been taking packet captures of outgoing FIX messages from a quickfixj initiator to an acceptor.

To summarise the environment, the java intiator makes a socket connection to a server socket on another server. Both servers are running Redhat Enterprise Linux 5.10. The MSS from a netstat on the interfaces is 0. The MTU of the NICs are all 1500 (inifinite I believe for the loopback interface). On the application side the messages are encoded into a byte array by quickfixj and written to the socket. The socket is configured with TCP_NODELAY enabled.

I am almost sure I can eliminate the application as the cause of the latency, as when the acceptor (the ServerSocket) is run on the same server as the Initiator using the loopback interface, there is no sender latency. This is an example of some packet capture entries using the loopback interface:

"No.","Time","Source","Destination","Protocol","Length","SendingTime (52)","MsgSeqNum (34)","Destination Port","Info","RelativeTime","Delta","Push"
"0.001606","10:23:29.223638","127.0.0.1","127.0.0.1","FIX","1224","20150527-09:23:29.223","5360","6082","MarketDataSnapshotFullRefresh","0.001606","0.000029","Set"
"0.001800","10:23:29.223832","127.0.0.1","127.0.0.1","FIX","1224","20150527-09:23:29.223","5361","6082","MarketDataSnapshotFullRefresh","0.001800","0.000157","Set"
"0.001823","10:23:29.223855","127.0.0.1","127.0.0.1","FIX","1224","20150527-09:23:29.223","5362","6082","MarketDataSnapshotFullRefresh","0.001823","0.000023","Set"
"0.002105","10:23:29.224137","127.0.0.1","127.0.0.1","FIX","825","20150527-09:23:29.223","5363","6082","MarketDataSnapshotFullRefresh","0.002105","0.000282","Set"
"0.002256","10:23:29.224288","127.0.0.1","127.0.0.1","FIX","2851","20150527-09:23:29.224,20150527-09:23:29.224,20150527-09:23:29.224","5364,5365,5366","6082","MarketDataSnapshotFullRefresh","0.002256","0.000014","Set"
"0.002327","10:23:29.224359","127.0.0.1","127.0.0.1","FIX","825","20150527-09:23:29.224","5367","6082","MarketDataSnapshotFullRefresh","0.002327","0.000071","Set"
"0.287124","10:23:29.509156","127.0.0.1","127.0.0.1","FIX","1079","20150527-09:23:29.508","5368","6082","MarketDataSnapshotFullRefresh","0.287124","0.284785","Set"

The main things of interest there being that 1/ despite the packet length (the biggest here is 2851) the PUSH flag is set on each packet. And 2/ the measure of latency I'm measuring here is the "Sending Time" set by the message before its encoded, and the packet capture time "Time". The packet capture is being done on the same server as the Initiator that is sending the data. For a packet capture of 10,000 packets there is no great difference between "SendingTime" and "Time" when using loopback. For this reason I think I can eliminate the application as the cause of the sending latency.

When the acceptor is moved to another server on the LAN, the sending latency starts to get worse on packets that are greater than the MTU size. This is a snippet of the a capture:

"No.","Time","Source","Destination","Protocol","Length","SendingTime (52)","MsgSeqNum (34)","Destination Port","Info","RelativeTime","Delta","Push"
"68.603270","10:35:18.820635","10.XX.33.115","10.XX.33.112","FIX","1223","20150527-09:35:18.820","842","6082","MarketDataSnapshotFullRefresh","68.603270","0.000183","Set"
"68.603510","10:35:18.820875","10.XX.33.115","10.XX.33.112","FIX","1223","20150527-09:35:18.820","843","6082","MarketDataSnapshotFullRefresh","68.603510","0.000240","Set"
"68.638293","10:35:18.855658","10.XX.33.115","10.XX.33.112","FIX","1514","20150527-09:35:18.821","844","6082","MarketDataSnapshotFullRefresh","68.638293","0.000340","Not set"
"68.638344","10:35:18.855709","10.XX.33.115","10.XX.33.112","FIX","1514","20150527-09:35:18.821","845","6082","MarketDataSnapshotFullRefresh","68.638344","0.000051","Not set"

What's significant here is when the packets are smaller than the MSS (derived from the MTU) then the PUSH flag is set and there is no sender latency. This would be expected as disabling Nagle's algorithm will be causing a PUSH to be set on these smaller packets. When the packet size is bigger than the MSS - a packet size of 1514 in this case - the difference between the time the packet is captured and the SendingTime has jumped to 35ms.

It doesn't seem likely that this 35ms latency is caused by the application encoding the messages, as large packet size messages were sent in <1ms on the loopback interface. The capture also takes place on the sender side, so it doesn't seem that the MTU segmentation that can be the cause either. The most likely reason seems to me that because there is no PUSH flag set - as the packet is larger than the MSS - then the socket and/or TCP stack at the OS level is not deciding to flush it until 35ms later. The test acceptor on the other server is not a slow consumer and is on the same LAN, so ACKs are timely.

Can anyone give any pointers as to what could cause this socket sending latency for > MSS packets? Against a real counterparty in the US this sender latency reaches as high as 300ms. I thought if a packet size was greater than the MSS then it would be sent immediately regardless of previous ACKS (as long as the socket buffer size was not exceeded). Netstat generally shows 0 socket q and wind sizes and the issue seems to occur on all > MSS packets, even from startup. This looks like the socket is deciding not to flush immediately for some reason, but unsure what factors could cause that.

Edit: As pointed out by EJP, there is no flush in linux. The socket send puts the data in the linux kernal's network buffers as I understand it. And it seems for these non-push packets, the kernel is waiting for the ack from the previous packet before it delivers it. This isn't what I'd expect, in TCP I'd expect the packet to still be delivered until the socket buffers filled up.

1
You can eliminate Java from this as well. Java doesn't decide TCP segment sizes, set PUSH flags, etc., and there is no 'socket flush behaviour' unless you're using a buffered stream of some kind.user207421
Indeed, only mentioned it being a java socket for completeness.glynbach
And I agree, flush is badly worder, as I understand it after the socket send the data is in the linux kernel network buffers. It looks to me like they are waiting for an ack from the receiver for the previous packet before they deliver it.glynbach

1 Answers

1
votes

This is not a comprehensive answer as TCP behaviour will differ depending on a lot of factors. But in this case, this was the reason for the problem we faced.

The congestion window, in the TCP congestion control implementation, allows for an increasing amount of packets to be sent without an acknowledgement as long as it doesn't detect signs of congestion, i.e retransmissions. Generally speaking, when these occur, the congestion algorithm will reset the congestion window limiting the packets that can be sent before an ack can be sent. This manifests itself in the sender latency we witnessed, as packets were held in the kernel buffer awaiting ackowledgements for prior packets. There are no TCP_NODELAY, TCP_CORK etc. type instructions that will override the congestion control behaviour in this regard.

In our case this was made worse by a long round trip time to the other venue. However, as it was a dedicated line with very little packet loss per day, it was not retransmissions that were the cause of the congestion control kicking in. In the end it appears to have been solved by disabling the following flag in linux. This would also cause the congestion window to be reset, but through detecting idleness rather than packet loss:

"tcp_slow_start_after_idle - BOOLEAN If set, provide RFC2861 behavior and time out the congestion window after an idle period. An idle period is defined at the current RTO. If unset, the congestion window will not be timed out after an idle period. Default: 1

(Note if you face these issues it is also possible to investigate other forms of congestion control algorithm than the ones your kernel might be currently set up for).