So, I have application A on one server which sends 710 HTTP POST messages per second to application B on another server, which is listening on a single port. The connections are not keep-alive; they are closed.
After a few minutes, application A reports that it can't open new connections to application B.
I am running netstat continuously on both machines, and see that a huge number of TIME_WAIT connections are open on each. Virtually all connections showing are in TIME_WAIT. From reading online, it seems that this is the state it's in for 30 seconds (on our machines 30 seconds according to /proc/sys/net/ipv4/tcp_fin_timeout value) after each side closes the connection.
I have a script running on each machine that's continuously doing:
netstat -na | grep 5774 | wc -l
and:
netstat -na | grep 5774 | grep "TIME_WAIT" | wc -l
The value of each, on each machine, seems to get to around 28,000 before application A reports that it can't open new connections to application B.
I've read that this file: /proc/sys/net/ipv4/ip_local_port_range provides the total number of connections that can be open at once:
$ cat /proc/sys/net/ipv4/ip_local_port_range 32768 61000
61000 - 32768 = 28232, which is right in line with the approximately 28,000 TIME_WAITs I am seeing.
My question is how is it possible to have so many connections in TIME_WAIT.
It seems that at 710 connections per second being closed, I should see approximately 710 * 30 seconds = 21300 of these at a given time. I suppose that just because there are 710 being opened per second doesn't mean that there are 710 being closed per second...
The only other thing I can think of is a slow OS getting around to closing the connections.