2
votes

We are facing a strange problem that closing a dead tcp socket (caused by unplugged the wire) would affect another normal open tcp socket. below is the detailed information:

  1. Topology
    Client A ←→ Switch A ← Router A:NAT ← .. Network .. → Router B:NAT → Switch B ←→ Server B

  2. Problem:
    Suppose between the client and the server, there is a dead connection which is caused by unplugging the wire. After unplugging the cable (between the machine and the switch) we login the client A from another machine and now there would be a new tcp connection between client and Server and this connection is OK.

    We find that, from server, if we close the dead tcp connection while the tcp kernel is still retransmitting data, then the other tcp connection would seem to be polluted and the direction from client to the server would become unavailable which means data sent by client via the connection would never be received by the server, but what surprised us is the other direction -- from the server to the client -- remains OK, via the same tcp socket data sent by the server reached the client machine.

    But if we wait until the tcp data transmission of the dead connection stops,e.g. 2 hours, and then close the socket, then the other tcp connection remains OK.

Here are the detailed steps for this issue:
1. There are two clients which are both behind Router A: NAT, the NAT is full-cone.
2. There is a linux server behind Router B:NAT, the NAT is full-cone, but here it uses port forwarding.
3. Four machines, and two clients say they are X, Y, the server Say it is S.
4. X and Y login and setup a video meeting, now they both create a tcp connection to the server, say they are channel CX and channel CY
5. Unplug the cable of machine on which Y client is running, now channel CY is broken and dead. But channel CX remains OK.
6. Login Y from the fourth machine and setup a video meeting with X again, now there is a new tcp channel, Say it is CY2.

Result:
In Step 6, if the server closes the dead connection -- CY --in minutes, then the new channel CY2 would become unidirection -- the data sent from client Y cann't reach the server including the ACK packets while it is ok for the vice verse.

if the server closes the dead connection -- CY -- in long time such 2 hours, then NO problem occurs.

This problem only happens when running through NATs, at least we never reproduce it when we run the applications within a same LAN (no need to traverse a NAT).

Does anybody know why it would happen?

Edit:
On the server side, we are using non-blocking tcp sockets and select model.

     psuedocode:  
     //server
     listenfd = socket(,SO_STREAM,);
     localAddr.port = htons(8013);
     localAddr.ip = inet_addr(INADDR_ANY);
     bind(localAddr...)
     listen(listenfd, 100);

     ...
     //using select model
     select(maxFd, &fdSet, NULL, NULL);
     for(...)
     {
     if (FD_ISSET(listenfd))
        {
        fd = accept(...)
        set_non_block(fd);
        ...
        }
     ...
     }

More Information:
1) connection A on First machine: 192.168.10.4:13000 ←→ ... ← Router A:NAT ← -Now: from PublicIP:8661 (random)..Network .. → Router B:NAT (to port:8013, Port Forwarding) → ... ←→ Server B

2) connection B on Second machine: 192.168.10.7:13000 ←→ ... ← Router A:NAT ← -Now: from PublicIP:8777 (random)..Network .. → Router B:NAT (to port:8013, Port Forwarding) → ... ←→ Server B

3) unplug the wire and connection A is dead, now create a new connection C on third machine: 192.168.10.10:13000 ←→ ... ← Router A:NAT ← -Now: from PublicIP:8869 (random).. Network .. → Router B:NAT (to port:8013, Port Forwarding) → ... ←→ Server B

if we close connection A from server, then connection C would become unidirection, but if we close connection A in 2 hours from server, then connection C remains OK.

1
If there is only one route between client A and server B, how is the server establishing a connection with the client when the wire is unplugged?jxh
@user315052 there are many machines connecting to the switch in the LAN, all these machines share a public IP.Wallace
So you are saying you unplugged a cable between the switch and the router, but there may be more than one cable between the switch and router? I am trying to imagine how I would reproduce your problem, but I can't seem to figure it out.jxh
something wrong with your server code? seems a logical conclusion to me...Karoly Horvath
@StevePeng - you'll find that having detailed log files that report the each socket operation and match that to a corresponding wireshark trace will help in revealing the bug in your code. It's not fun, but since you got a repro case, it shouldn't be hard to track down.selbie

1 Answers

2
votes

Wow, what a conundrum. I do think that I have a possible answer though. And I don't really like the implications - but I guess they are inevitable when looking at the standard (here is a wikipedia simplification).

NAT (and especially full-cone) works by giving a client a internal address (ip and port), to match the external address it is trying to reach. Any return traffic is sent internal address and then forwarded to the external address by the router.

Lets use an example to expand this brief explination and show what this means for you...

Suppose you have a NAT gateway, forwarding port 80 to an internal server, the internal destination is also port 80. The gateway has external IP n.n.n.n and internal IP y.y.y.y.

When a client connects to n.n.n.n:80 the NAT server faithfully forwards the request to y.y.y.y:80, but in the process it rewrote the IP frame. The sender address is now the NAT gateways internal IP, and the sender port is no longer what the client wrote, but a new one assigned by the NAT gateway.

The new port is assigned by the NAT gateway, yes. But it is assigned as a function of the client IP and the port it tried to access, in this case 80.

All well and good, but... When a client established its second connection the same mapping function is used. This should not pose a problem? Well it can. If the gateway does not distinguish between different client addresses (each connection from a client should have a unique port, ideally), it will simply overwrite the mapping the old connection made.

Thus causing resend traffic from the old socket to be sent to the clients new socket.

Highly undesirable, but possible depending on how the NAT is implemented. And since it seems to be a problem of NAT - it will not show when directly connected...

Now, I already see a hole in my explanation - namely that this would mean that you could not have two sockets open to the same server simultaneously, because any return would be garbled. Well, the only reason I can think of that this works is that the socket is still open - and therefore the gateway does not regard it as dead and then creates a second mapping for that client.

Hope I made at least some sense.