5
votes

We have a class that broadcasts messages to a Service Fabric stateless service. This stateless service has a single partition, but with many replica's. The message should be send to all the replica's in the system. Therefore we query the FabricClient for the single partition, and all the replica's of that partition. We use standard HTTP communication (the stateless service has Communication Listener with a selfhosted OWIN listener, using WebListener/HttpSys) with a shared HttpClient instance. During a load test, we get many errors during the sending of messages. Note that we have other services in the same application, also communicating (WebListener/HttpSys, ServiceProxy, and ActorProxy).

The code where we see Exceptions is (stacktrace is below the code sample):

private async Task SendMessageToReplicas(string actionName, string message)
{
  var fabricClient = new FabricClient();
  var eventNotificationHandlerServiceUri = new Uri(ServiceFabricSettings.EventNotificationHandlerServiceName);

  var promises = new List<Task>();
  // There is only one partition of this service, but there are many replica's
  Partition partition = (await fabricClient.QueryManager.GetPartitionListAsync(eventNotificationHandlerServiceUri).ConfigureAwait(false)).First();

  string continuationToken = null;
  do
  {
    var replicas = await fabricClient.QueryManager.GetReplicaListAsync(partition.PartitionInformation.Id, continuationToken).ConfigureAwait(false);
    foreach(Replica replica in replicas)
    {
      promises.Add(SendMessageToReplica(replica, actionName, message));
    }

    continuationToken = replicas.ContinuationToken;
  } while(continuationToken != null);

  await Task.WhenAll(promises).ConfigureAwait(false);
}


private async Task SendMessageToReplica(Replica replica, string actionName, string message)
{
  if(replica.TryGetEndpoint(out Uri replicaUrl))
  {
    Uri requestUri = UriUtility.Combine(replicaUrl, actionName);
    using(var response = await _httpClient.PostAsync(requestUri, message == null ? null : new JsonContent(message)).ConfigureAwait(false))
    {
      string responseContent = await response.Content.ReadAsStringAsync().ConfigureAwait(false);
      if(!response.IsSuccessStatusCode)
      {
        throw new Exception();
      }
    }
  }
  else
  {
    throw new Exception();
  }
}

The following Exception is thrown:

System.Fabric.FabricTransientException: Could not ping any of the provided Service Fabric gateway endpoints. ---> System.Runtime.InteropServices.COMException: Exception from HRESULT: 0x80071C49
at System.Fabric.Interop.NativeClient.IFabricQueryClient9.EndGetPartitionList2(IFabricAsyncOperationContext context)
at System.Fabric.FabricClient.QueryClient.GetPartitionListAsyncEndWrapper(IFabricAsyncOperationContext context)
at System.Fabric.Interop.AsyncCallOutAdapter2`1.Finish(IFabricAsyncOperationContext context, Boolean expectedCompletedSynchronously)
--- End of inner exception stack trace ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Company.ServiceFabric.ServiceFabricEventNotifier.<SendMessageToReplicas>d__7.MoveNext() in c:\work\ServiceFabricEventNotifier.cs:line 138

During the same period we also see this Exception being thrown:

System.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 0 - An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full.) ---> System.ComponentModel.Win32Exception (0x80004005): An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full
at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection)
at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal& connection)
at System.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection)
at System.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
at System.Data.SqlClient.SqlConnection.TryOpenInner(TaskCompletionSource`1 retry)
at System.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource`1 retry)
at System.Data.SqlClient.SqlConnection.OpenAsync(CancellationToken cancellationToken)

The event logs on the machines in the cluster show these warnings:

Event ID: 4231
Source: Tcpip
Level: Warning
A request to allocate an ephemeral port number from the global TCP port space has failed due to all such ports being in use.

Event ID: 4227
Source: Tcpip
Level: Warning
TCP/IP failed to establish an outgoing connection because the selected local endpoint was recently used to connect to the same remote endpoint. This error typically occurs when outgoing connections are opened and closed at a high rate, causing all available local ports to be used and forcing TCP/IP to reuse a local port for an outgoing connection. To minimize the risk of data corruption, the TCP/IP standard requires a minimum time period to elapse between successive connections from a given local endpoint to a given remote endpoint.

And finally the Microsoft-Service Fabric admin log shows hundreds of warnings similar to

Event 4121
Source Microsoft-Service-Fabric
Level: Warning
client-02VM4.company.nl:19000/192.168.10.36:19000: error = 2147942452, failureCount=160522. Filter by (type~Transport.St && ~"(?i)02VM4.company.nl:19000") to get listener lifecycle. Connect failure is expected if listener was never started, or listener/its process was stopped before/during connecting.

Event 4097
Source Microsoft-Service-Fabric
Level: Warning
client-02VM4.company.nl:19000 : connect failed, having tried all addresses

After a while, the warnings become errors:

Event 4096
Source Microsoft-Service-Fabric
Level: Error
client-02VM4.company.nl:19000 failed to bind to local port for connecting: 0x80072747

Can anyone tell us why this happends, and what we can do to solve this? Are we doing something wrong?

3

3 Answers

4
votes

We (I work with the OP) have been testing this and it turned out to be the FabricClient as suggested by Esben Bach.

The documentation on the FabricClient also states:

It is highly recommended that you share FabricClients as much as possible. This is because the FabricClient has multiple optimizations such as caching and batching that you would not be able to fully utilize otherwise.

It seems the FabricClient behaves like the HttpClient class where you should also share the instance and when you don't you'll get the same problem, port exhaustion.

The common exceptions working with the FabricClient documentation however also mentions that when a FabricObjectClosedException occurs you should:

Dispose of the FabricClient object you are using and instantiate a new FabricClient object.

Sharing the FabricClient fixes the port exhaustion problem.

1
votes

It would seem you have a port exhaustion problem. Provided that is the case then Either you have to figure out how to reuse your connections, or you will have to implement some sort of throttling mechanism so you don't use up all the available ports.

Not sure how the fabric client behaves, it might be that it is responsible for the exhaustion, or perhaps its the SQL Server part that we cannot see the code for (but since you posted it in a log I assume its probably unrelated to your ping test).

Looking at the referencesource for httpwebresponse (https://github.com/Microsoft/referencesource/blob/master/System/net/System/Net/HttpWebResponse.cs) it might also be that disposing the response (i.e. your using statement for postasync) is closing the HttpClients connection. Meaning you are not reusing the connection but opening new ones all the time.

I would guess that testing a variant that does not dispose your httpwebresponse is a rather easy thing.

1
votes

What is the reason for calling each existing service instance?

Normally, you should call just one service instance provided by the SF runtime (it will try to chose one from the same node/process or from another node if this node is too loaded).

If you need to signal some state change/event throughout all of your service instances, maybe this should be done inside the service implementation so that it checks for this state change (from a stateful service maybe) or from a pub-sub event queue each time it needs this information (see for example https://github.com/loekd/ServiceFabric.PubSubActors).

Another idea is to send many messages to a service instance at once in a another action that supports bulk data.

Keeping the connection open like in the previous answer is a good solution if you must send individual messages from a single source with a high frequency.

Also, the caller should do connection resiliency, see for example https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-reliable-services-communication#communicating-with-a-service