We've got several web and worker roles in Azure connecting to our Azure Redis cache via the StackExchange.Redis library, and we're receiving regular timeouts that are making our end-to-end solution grind to a halt. An example of one of them is below:
System.TimeoutException: Timeout performing GET stream:459, inst: 4, mgr: Inactive, queue: 12, qu=0, qs=12, qc=0, wr=0/0, in=65536/0 at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message message, ResultProcessor
1 processor, ServerEndPoint server) in c:\TeamCity\buildAgent\work\58bc9a6df18a3782\StackExchange.Redis\StackExchange\Redis\ConnectionMultiplexer.cs:line 1785 at StackExchange.Redis.RedisBase.ExecuteSync[T](Message message, ResultProcessor
1 processor, ServerEndPoint server) in c:\TeamCity\buildAgent\work\58bc9a6df18a3782\StackExchange.Redis\StackExchange\Redis\RedisBase.cs:line 79 at StackExchange.Redis.RedisDatabase.StringGet(RedisKey key, CommandFlags flags) in c:\TeamCity\buildAgent\work\58bc9a6df18a3782\StackExchange.Redis\StackExchange\Redis\RedisDatabase.cs:line 1346 at OptiRTC.Cache.RedisCacheActions.<>c__DisplayClass41.<Get>b__3() in c:\dev\OptiRTCAzure\OptiRTC.Cache\RedisCacheActions.cs:line 104 at Polly.Retry.RetryPolicy.Implementation(Action action, IEnumerable
1 shouldRetryPredicates, Func`1 policyStateFactory) at OptiRTC.Cache.RedisCacheActions.Get[T](String key, Boolean allowDirtyRead) in c:\dev\OptiRTCAzure\OptiRTC.Cache\RedisCacheActions.cs:line 107 at OptiRTC.Cache.RedisCacheAccess.d__e4.MoveNext() in c:\dev\OptiRTCAzure\OptiRTC.Cache\RedisCacheAccess.cs:line 1196; TraceSource 'WaWorkerHost.exe' event
All the timeouts have different queue and qs numbers, but the rest of the messages are consistent. These StringGet calls are across different keys in the cache. In each of our services, we use a singleton cache access class with a single ConnectionMultiplexer that is registered with our IoC container in the web or worker role startup:
container.RegisterInstance<ICacheAccess>(cacheAccess);
In our implementation of ICacheAccess, we're creating the multiplexer as follows:
ConfigurationOptions options = new ConfigurationOptions();
options.EndPoints.Add(serverAddress);
options.Ssl = true;
options.Password = accessKey;
options.ConnectTimeout = 1000;
options.SyncTimeout = 2500;
redis = ConnectionMultiplexer.Connect(options);
where the redis object is used throughout the instance. We've got about 20 web and worker role instances connecting to the cache via this ICacheAccess implementation, but the management console shows an average of 200 concurrent connections to the cache.
I've seen other posts that reference using version 1.0.333 of StackExchange.Redis, which we're doing via NuGet, but when I look at the actual version of the StackExchange.Redis.dll reference added, it shows 1.0.316.0. We've tried adding and removing the NuGet reference as well as adding it to a new project, and we always get the version discrepancy.
Any insight would be appreciated. Thanks.
Additional information:
We've upgraded to 1.0.371. We have two services that each access the same cache object at different intervals, one to edit and occasionally read and one that reads this object several times a second. Both services are deployed with the same caching code and StackExchange.Redis library version. I almost never see time outs in the service that edits the object but I get timeouts between 50 and 75% of the time on the services that reads it. The timeouts have the same format as the one indicated above, and they continue to occur after wrapping the db.StringGet call in a Polly retry block that handles both RedisException and System.TimeoutException and retries once after 500ms.
We contacted Microsoft about this issue, and they confirm that they see nothing in the Redis logs that indicate an issue on the Redis service side. Our cache miss % is extremely low on the Redis server, but we continue to get these timeouts, which substantially hinder our application's functionality.
In response to the comments, yes, we always have a number in qs and never in qc. We always have a number in the first part of the in and never in the second.
Even more additional information:
When I run a service with fewer instances at a higher CPU, I get significantly more of these timeout errors than when instances are running at lower CPUs. More specifically, I pulled some numbers from our services this morning. When they were running at around 30% CPU, I saw very few timeout issues - just 42 over 30 minutes. When I removed half the instances and they started to run at around 60-65% CPU, the rate increased 10-fold to 536 over 30 minutes.