2
votes

I have a virtual machine running Windows Server 2012R2 within Azure cloud. This machine has its private and public IP address statically assigned. On that machine, I'm running client application (Jenkins Agent to be specific). This client opens TCP connection to its server (Jenkins Master), which is running outside of Azure cloud (behind some public IP address). TCP connection is established fine.

In order to keep this connection alive, both the client and the server are "pinging" each other every 4-5 mins. This "pinging" is done by exchanging several TCP packages through that opened TCP connection.

After some random time interval, client can't reach the server anymore and the server can't reach the client anymore. Therefore, connection timeout exceptions are thrown on both client and server ends.

To analyze the issue, I was tracking this communication with Wireshark, which is running on Windows Server in Azure cloud (where the client application is running). While the communication works well, Wireshark shows TCP traffic is exchanged between: - client's private IP address / local port - server's public IP address / port

This seems perfectly logical because Azure machine (client) is not aware of its public IP address and publicly visible port (after NAT is applied).

When the issue starts occurring, I see that both client and server are sending TCP retransmission packets, which means that neither of them received TCP:ACK packet to some previously sent TCP:PSH packet. Most strange of all is that client machine was receiving these TCP retransmissions from the server but the problem is: those packages are not sent to client's private IP/local post. Those packages are shown in Wireshark as being sent to client's public IP and publicly visible port! Obviously the client application doesn't receive these packages because machine's NIC/driver discards them (which is also expected).

QUESTION: Does anyone have any idea why the TCP responses sent to Azure machine's (client's) public IP address and publicly visible port sometimes reaches the machine itself without NAT translation being applied to that content?!

1
The problem seems to be worked-around by shortening pinging time interval. Now client and server and exchanging data every 30 seconds and connection looks healthy for the last 2 hours. If this remains, I'd feel free to conclude that TCP session inactivity time was triggering the issue...Marko Andrijevic

1 Answers

0
votes

After 3 days of tracking the status, no issue re-occurrences have been noticed! So I'm resolving this question with conclusion: more frequent client/server pinging (i.e. keeping connection alive) definitely works around this Azure problem.