We are facing a strange problem that closing a dead tcp socket (caused by unplugged the wire) would affect another normal open tcp socket. below is the detailed information:
Topology
Client A ←→ Switch A ← Router A:NAT ← .. Network .. → Router B:NAT → Switch B ←→ Server BProblem:
Suppose between the client and the server, there is a dead connection which is caused by unplugging the wire. After unplugging the cable (between the machine and the switch) we login the client A from another machine and now there would be a new tcp connection between client and Server and this connection is OK.We find that, from server, if we close the dead tcp connection while the tcp kernel is still retransmitting data, then the other tcp connection would seem to be polluted and the direction from client to the server would become unavailable which means data sent by client via the connection would never be received by the server, but what surprised us is the other direction -- from the server to the client -- remains OK, via the same tcp socket data sent by the server reached the client machine.
But if we wait until the tcp data transmission of the dead connection stops,e.g. 2 hours, and then close the socket, then the other tcp connection remains OK.
Here are the detailed steps for this issue:
1. There are two clients which are both behind Router A: NAT, the NAT is full-cone.
2. There is a linux server behind Router B:NAT, the NAT is full-cone, but here it uses port forwarding.
3. Four machines, and two clients say they are X, Y, the server Say it is S.
4. X and Y login and setup a video meeting, now they both create a tcp connection to the server, say they are channel CX and channel CY
5. Unplug the cable of machine on which Y client is running, now channel CY is broken and dead. But channel CX remains OK.
6. Login Y from the fourth machine and setup a video meeting with X again, now there is a new tcp channel, Say it is CY2.
Result:
In Step 6, if the server closes the dead connection -- CY --in minutes, then the new channel CY2 would become unidirection -- the data sent from client Y cann't reach the server including the ACK packets while it is ok for the vice verse.
if the server closes the dead connection -- CY -- in long time such 2 hours, then NO problem occurs.
This problem only happens when running through NATs, at least we never reproduce it when we run the applications within a same LAN (no need to traverse a NAT).
Does anybody know why it would happen?
Edit:
On the server side, we are using non-blocking tcp sockets and select model.
psuedocode:
//server
listenfd = socket(,SO_STREAM,);
localAddr.port = htons(8013);
localAddr.ip = inet_addr(INADDR_ANY);
bind(localAddr...)
listen(listenfd, 100);
...
//using select model
select(maxFd, &fdSet, NULL, NULL);
for(...)
{
if (FD_ISSET(listenfd))
{
fd = accept(...)
set_non_block(fd);
...
}
...
}
More Information:
1) connection A on First machine: 192.168.10.4:13000 ←→ ... ← Router A:NAT ← -Now: from PublicIP:8661 (random)..Network .. → Router B:NAT (to port:8013, Port Forwarding) → ... ←→ Server B
2) connection B on Second machine: 192.168.10.7:13000 ←→ ... ← Router A:NAT ← -Now: from PublicIP:8777 (random)..Network .. → Router B:NAT (to port:8013, Port Forwarding) → ... ←→ Server B
3) unplug the wire and connection A is dead, now create a new connection C on third machine: 192.168.10.10:13000 ←→ ... ← Router A:NAT ← -Now: from PublicIP:8869 (random).. Network .. → Router B:NAT (to port:8013, Port Forwarding) → ... ←→ Server B
if we close connection A from server, then connection C would become unidirection, but if we close connection A in 2 hours from server, then connection C remains OK.