Trying to solve an issue where there is a significant amount of latency on outgoing messages that seems to be related to the socket flush behaviour. I've been taking packet captures of outgoing FIX messages from a quickfixj initiator to an acceptor.
To summarise the environment, the java intiator makes a socket connection to a server socket on another server. Both servers are running Redhat Enterprise Linux 5.10. The MSS from a netstat on the interfaces is 0. The MTU of the NICs are all 1500 (inifinite I believe for the loopback interface). On the application side the messages are encoded into a byte array by quickfixj and written to the socket. The socket is configured with TCP_NODELAY enabled.
I am almost sure I can eliminate the application as the cause of the latency, as when the acceptor (the ServerSocket) is run on the same server as the Initiator using the loopback interface, there is no sender latency. This is an example of some packet capture entries using the loopback interface:
"No.","Time","Source","Destination","Protocol","Length","SendingTime (52)","MsgSeqNum (34)","Destination Port","Info","RelativeTime","Delta","Push"
"0.001606","10:23:29.223638","127.0.0.1","127.0.0.1","FIX","1224","20150527-09:23:29.223","5360","6082","MarketDataSnapshotFullRefresh","0.001606","0.000029","Set"
"0.001800","10:23:29.223832","127.0.0.1","127.0.0.1","FIX","1224","20150527-09:23:29.223","5361","6082","MarketDataSnapshotFullRefresh","0.001800","0.000157","Set"
"0.001823","10:23:29.223855","127.0.0.1","127.0.0.1","FIX","1224","20150527-09:23:29.223","5362","6082","MarketDataSnapshotFullRefresh","0.001823","0.000023","Set"
"0.002105","10:23:29.224137","127.0.0.1","127.0.0.1","FIX","825","20150527-09:23:29.223","5363","6082","MarketDataSnapshotFullRefresh","0.002105","0.000282","Set"
"0.002256","10:23:29.224288","127.0.0.1","127.0.0.1","FIX","2851","20150527-09:23:29.224,20150527-09:23:29.224,20150527-09:23:29.224","5364,5365,5366","6082","MarketDataSnapshotFullRefresh","0.002256","0.000014","Set"
"0.002327","10:23:29.224359","127.0.0.1","127.0.0.1","FIX","825","20150527-09:23:29.224","5367","6082","MarketDataSnapshotFullRefresh","0.002327","0.000071","Set"
"0.287124","10:23:29.509156","127.0.0.1","127.0.0.1","FIX","1079","20150527-09:23:29.508","5368","6082","MarketDataSnapshotFullRefresh","0.287124","0.284785","Set"
The main things of interest there being that 1/ despite the packet length (the biggest here is 2851) the PUSH flag is set on each packet. And 2/ the measure of latency I'm measuring here is the "Sending Time" set by the message before its encoded, and the packet capture time "Time". The packet capture is being done on the same server as the Initiator that is sending the data. For a packet capture of 10,000 packets there is no great difference between "SendingTime" and "Time" when using loopback. For this reason I think I can eliminate the application as the cause of the sending latency.
When the acceptor is moved to another server on the LAN, the sending latency starts to get worse on packets that are greater than the MTU size. This is a snippet of the a capture:
"No.","Time","Source","Destination","Protocol","Length","SendingTime (52)","MsgSeqNum (34)","Destination Port","Info","RelativeTime","Delta","Push"
"68.603270","10:35:18.820635","10.XX.33.115","10.XX.33.112","FIX","1223","20150527-09:35:18.820","842","6082","MarketDataSnapshotFullRefresh","68.603270","0.000183","Set"
"68.603510","10:35:18.820875","10.XX.33.115","10.XX.33.112","FIX","1223","20150527-09:35:18.820","843","6082","MarketDataSnapshotFullRefresh","68.603510","0.000240","Set"
"68.638293","10:35:18.855658","10.XX.33.115","10.XX.33.112","FIX","1514","20150527-09:35:18.821","844","6082","MarketDataSnapshotFullRefresh","68.638293","0.000340","Not set"
"68.638344","10:35:18.855709","10.XX.33.115","10.XX.33.112","FIX","1514","20150527-09:35:18.821","845","6082","MarketDataSnapshotFullRefresh","68.638344","0.000051","Not set"
What's significant here is when the packets are smaller than the MSS (derived from the MTU) then the PUSH flag is set and there is no sender latency. This would be expected as disabling Nagle's algorithm will be causing a PUSH to be set on these smaller packets. When the packet size is bigger than the MSS - a packet size of 1514 in this case - the difference between the time the packet is captured and the SendingTime has jumped to 35ms.
It doesn't seem likely that this 35ms latency is caused by the application encoding the messages, as large packet size messages were sent in <1ms on the loopback interface. The capture also takes place on the sender side, so it doesn't seem that the MTU segmentation that can be the cause either. The most likely reason seems to me that because there is no PUSH flag set - as the packet is larger than the MSS - then the socket and/or TCP stack at the OS level is not deciding to flush it until 35ms later. The test acceptor on the other server is not a slow consumer and is on the same LAN, so ACKs are timely.
Can anyone give any pointers as to what could cause this socket sending latency for > MSS packets? Against a real counterparty in the US this sender latency reaches as high as 300ms. I thought if a packet size was greater than the MSS then it would be sent immediately regardless of previous ACKS (as long as the socket buffer size was not exceeded). Netstat generally shows 0 socket q and wind sizes and the issue seems to occur on all > MSS packets, even from startup. This looks like the socket is deciding not to flush immediately for some reason, but unsure what factors could cause that.
Edit: As pointed out by EJP, there is no flush in linux. The socket send puts the data in the linux kernal's network buffers as I understand it. And it seems for these non-push packets, the kernel is waiting for the ack from the previous packet before it delivers it. This isn't what I'd expect, in TCP I'd expect the packet to still be delivered until the socket buffers filled up.