I'm running into a deadlock situation when calling StackExchange.Redis.
I don't know exactly what is going on, which is very frustrating, and I would appreciate any input that could help resolve or workaround this problem.
In case you have this problem too and don't want to read all this; I suggest that you'll try setting
PreserveAsyncOrder
tofalse
.ConnectionMultiplexer connection = ...; connection.PreserveAsyncOrder = false;
Doing so will probably resolve the kind of deadlock that this Q&A is about and could also improve performance.
Our setup
- The code is run as either a Console application or as an Azure Worker Role.
- It exposes a REST api using HttpMessageHandler so the entry point is async.
- Some parts of the code have thread affinity (is owned by, and must be run by, a single thread).
- Some parts of the code is async-only.
- We are doing the sync-over-async and async-over-sync anti-patterns. (mixing
await
andWait()
/Result
). - We're only using async methods when accessing Redis.
- We're using StackExchange.Redis 1.0.450 for .NET 4.5.
Deadlock
When the application/service is started it runs normally for a while then all of a sudden (almost) all incoming requests stop functioning, they never produce a response. All those requests are deadlocked waiting for a call to Redis to complete.
Interestingly, once the deadlock occur, any call to Redis will hang but only if those calls are made from an incoming API request, which are run on the thread pool.
We are also making calls to Redis from low priority background threads, and these calls continue to function even after the deadlock occurred.
It seems as if a deadlock will only occur when calling into Redis on a thread pool thread. I no longer think this is due to the fact that those calls are made on a thread pool thread. Rather, it seems like any async Redis call without continuation, or with a sync safe continuation, will continue to work even after the deadlock situation has occurred. (See What I think happens below)
Related
StackExchange.Redis Deadlocking
Deadlock caused by mixing
await
andTask.Result
(sync-over-async, like we do). But our code is run without synchronization context so that doesn't apply here, right?How to safely mix sync and async code?
Yes, we shouldn't be doing that. But we do, and we'll have to continue doing so for a while. Lots of code that needs to be migrated into the async world.
Again, we don't have a synchronization context, so this should not be causing deadlocks, right?
Setting
ConfigureAwait(false)
before anyawait
has no effect on this.Timeout exception after async commands and Task.WhenAny awaits in StackExchange.Redis
This is the thread hijacking problem. What's the current situation on this? Could this be the problem here?
StackExchange.Redis async call hangs
From Marc's answer:
...mixing Wait and await is not a good idea. In addition to deadlocks, this is "sync over async" - an anti-pattern.
But he also says:
SE.Redis bypasses sync-context internally (normal for library code), so it shouldn't have the deadlock
So, from my understanding StackExchange.Redis should be agnostic to whether we're using the sync-over-async anti-pattern. It's just not recommended as it could be the cause of deadlocks in other code.
In this case, however, as far as I can tell, the deadlock is really inside StackExchange.Redis. Please correct me if I'm wrong.
Debug findings
I've found that the deadlock seems to have its source in ProcessAsyncCompletionQueue
on line 124 of CompletionManager.cs
.
Snippet of that code:
while (Interlocked.CompareExchange(ref activeAsyncWorkerThread, currentThread, 0) != 0)
{
// if we don't win the lock, check whether there is still work; if there is we
// need to retry to prevent a nasty race condition
lock(asyncCompletionQueue)
{
if (asyncCompletionQueue.Count == 0) return; // another thread drained it; can exit
}
Thread.Sleep(1);
}
I've found that during the deadlock; activeAsyncWorkerThread
is one of our threads that is waiting for a Redis call to complete. (our thread = a thread pool thread running our code). So the loop above is deemed to continue forever.
Without knowing the details, this sure feels wrong; StackExchange.Redis is waiting for a thread that it thinks is the active async worker thread while it is in fact a thread that is quite the opposite of that.
I wonder if this is due to the thread hijacking problem (which I don't fully understand)?
What to do?
The main two question I'm trying to figure out:
Could mixing
await
andWait()
/Result
be the cause of deadlocks even when running without synchronization context?Are we running into a bug/limitation in StackExchange.Redis?
A possible fix?
From my debug findings it seems as the problem is that:
next.TryComplete(true);
...on line 162 in CompletionManager.cs
could under some circumstances let the current thread (which is the active async worker thread) wander off and start processing other code, possibly causing a deadlock.
Without knowing the details and just thinking about this "fact", then it would seem logical to temporarily release the active async worker thread during the TryComplete
invocation.
I guess that something like this could work:
// release the "active thread lock" while invoking the completion action
Interlocked.CompareExchange(ref activeAsyncWorkerThread, 0, currentThread);
try
{
next.TryComplete(true);
Interlocked.Increment(ref completedAsync);
}
finally
{
// try to re-take the "active thread lock" again
if (Interlocked.CompareExchange(ref activeAsyncWorkerThread, currentThread, 0) != 0)
{
break; // someone else took over
}
}
I guess my best hope is that Marc Gravell would read this and provide some feedback :-)
No synchronization context = The default synchronization context
I've written above that our code does not use a synchronization context. This is only partially true: The code is run as either a Console application or as an Azure Worker Role. In these environments SynchronizationContext.Current
is null
, which is why I wrote that we're running without synchronization context.
However, after reading It's All About the SynchronizationContext I've learned that this is not really the case:
By convention, if a thread’s current SynchronizationContext is null, then it implicitly has a default SynchronizationContext.
The default synchronization context should not be the cause of deadlocks though, as UI-based (WinForms, WPF) synchronization context could - because it does not imply thread affinity.
What I think happens
When a message is completed its completion source is checked for whether it is considered sync safe. If it is, the completion action is executed inline and everything is fine.
If it is not, the idea is to execute the completion action on a newly allocated thread pool thread. This too works just fine when ConnectionMultiplexer.PreserveAsyncOrder
is false
.
However, when ConnectionMultiplexer.PreserveAsyncOrder
is true
(the default value), then those thread pool threads will serialize their work using a completion queue and by ensuring that at most one of them is the active async worker thread at any time.
When a thread becomes the active async worker thread it will continue to be that until it have drained the completion queue.
The problem is that the completion action is not sync safe (from above), still it is executed on a thread that must not be blocked as that will prevent other non sync safe messages from being completed.
Notice that other messages that are being completed with a completion action that is sync safe will continue to work just fine, even though the active async worker thread is blocked.
My suggested "fix" (above) would not cause a deadlock in this way, it would however mess with the notion of preserving async completion order.
So maybe the conclusion to make here is that it is not safe to mix await
with Result
/Wait()
when PreserveAsyncOrder
is true
, no matter whether we are running without synchronization context?
(At least until we can use .NET 4.6 and the new TaskCreationOptions.RunContinuationsAsynchronously
, I suppose)