0
votes

I am running a Service Fabric application on a cluster in Azure. The cluster has two scale sets:

  • 4x B2ms nodes where a stateful service type is placed with Placement Constraints (Primary Scale Set)
  • 2x F1 nodes where a stateless service type is placed.

There are two types of services in the application

  • WebAPI - stateless service used for receiving statuses from a system via HTTP and sending them to the StatusConsumer.
  • StatusConsumer - stateful service which processes the statuses and keeps the last one. Service instance is created for each system. It communicates via RemotingV2.

For my tests I am using Application Insights and Service Fabric Analytics to track performance. I am observing the following parameters:

Metrics for the stateful scale set: Percentage CPU, Disk Read/Write operations/sec

Applicaiton Insights: Server Response Time - which corresponds to the execution time of the method that receives the statuses in the stateful StatusConsumer.

Service Fabric Analytics: Performance Counters with log analytics agent on the stateful nodes - used to observe the RAM usage of the nodes.

Each simulated system sends its status every 30 seconds.

In the beginning of the test the RAM usage is around 40% on each node, Avg Server Response time is around 15ms, CPU usage is around 10%, and Read/Write operations are under 100/s.

Immediately after the start of the test the RAM usage starts slowly building up, but there is no difference in the other observed metrics.

After about an hour of simulating 1000 systems the RAM usage is around 90-95% and problems start to show in the other metrics - Avg Server response time peaks with values around 5-10 seconds and Disk Read/Write operations reach around 500/sec.

This continues for 1-3 minutes then the RAM usage drops and everything goes back to normal.RAM Usage Server Response Time

On the images you can see that the RAM peak corresponds to the server response time peak. In the end of the RAM graph the usage is flat in order to show what is the behaviour without simulating any systems.

The number of the systems simulated only reduces or increases the time it takes for the RAM to reach critical levels - in one of the tests 200 systems were simulated and the RAM usage rise was slower.

As for the code: In the first tests the code was more complicated, but to find the cause of the problem I started removing functionality. Still there was no improvement. The only time the RAM usage didn`t rise was when I commented all the code and in the body of the method that receives the status were only a try/catch block and a return. Currently the code in the StatusConsumer is this:

public async Task PostStatus(SystemStatusInfo status){
try{
    Stopwatch stopWatch = new Stopwatch();
    IReliableDictionary<string, SystemStatusInfo> statusDictionary = await this.StateManager.GetOrAddAsync<IReliableDictionary<string, SystemStatusInfo>>("status");

    stopWatch.Start();

    using (ITransaction tx = this.StateManager.CreateTransaction())
    {
        await statusDictionary.AddOrUpdateAsync(tx,"lastConsumedStatus",(key) => { return status; },(key, oldvalue) => status);
        await tx.CommitAsync();
    }

    stopWatch.Stop();

    if (stopWatch.ElapsedMilliseconds / 1000 > 4) //seconds
    {
        Telemetry.TrackTrace($"Queue Status Duration: { stopWatch.ElapsedMilliseconds / 1000 } for {status.SystemId}", SeverityLevel.Critical);
    }
}
catch (Exception e) {Telemetry.TrackException(e);}
}

How can I diagnose and/or fix this?

PS: After connecting to the nodes with remote desktop, in Task Manager I can see that when the RAM usage is around 85% the "Memory" of the SystemStatusConsumer process, which "holds" the microservice's instances, is not more than 600 MB. It is the highest consumption but it is still not that high - the node is with 8 GB of RAM. However I don`t know if this is useful information in this case.

1

1 Answers

0
votes

After talking to Azure Support and multiple tests I drastically reduced the memory consumption of the services.

The main thing I learned from the communication with the support was that it is really not a good idea to have a large number of services containing small amount of data each! Memory dumps of the application showed that each service had roughly 20KB of actual data and 700KB of logs of changes in the Reliable Collections accumulated by Service Fabric. This may not be exact numbers but the difference was huge.

To reduce the number of services I combined the processing and saving of multiple systems` statuses into one service by using a kind of partitioning. I also tried using Actors. All methods worked well.

There are several other settings I used to reduce the memory consumption, but the big difference was made by changing the architecture of the services themselves:

In the Settings.xml of the service itself:

  • CheckpointThresholdInMB = 1

  • LogTruncationIntervalSeconds = 1200 (Setting this value to less than 120 actually din`t do anything or made things worse. Try using values larger than 300)

  • MaxAccumulatedBackupLogSizeInMB = 1

In the code of the service itself:

  • ServicePointManager.DefaultConnectionLimit = 200

  • MaxConcurrentCalls = 512 (RemotingListener and Client)

Cluster Settings:

  • AutomaticMemoryConfiguration - 0 (If you do not set this setting the others won`t work)
  • WriteBufferMemoryPoolMinimumInKB - 16MB.
  • WriteBufferMemoryPoolMaximumInKB - 32MB.