7
votes

I'm using StackExchange.Redis (SE.R henceforth) in my Nancy application. There is a single global ConnectionMultiplexer that gets automatically passed around by Nancy's TinyIoC via constructor parameters, and any time I try and use GetDatabase and one of the *Async methods (sync methods only start failing after one of the async methods have been attempted) my application deadlocks.

Looking at my parallel stacks it appears that I have four threads:

  1. The thread that called Result on one of my Tasks that uses SE.R. (Has lots of Nancy stuff on the stack, then a call to my library that utilizes SE.R, and a call to Result. Top of the stack is Monitor.Wait).
  2. A thread that spawned two other threads. I assume this is managed by SE.R. Starts with Native to Managed Transition, ThreadHelper.ThreadStart, and at the top of the stack is ThreadHelper.ThreadStart_Context.
  3. A small stack that is stuck like this:
    • Monitor.Wait
    • Monitor.Wait
    • SocketManager.WriteAllQueues
    • SocketManager.cctor.AnonymousMethod__16
  4. Another small stack that looks like this:
    • Managed to Native Transition
    • SocketManager.ReadImpl
    • SocketManager.Read
    • SocketManager.cctor.AnonymousMethod__19

I'm almost sure this is a deadlock of some sort. I even think it might have something to do with this question. But I have no idea what to do about it.

The ConnectionMultiplexer is set up in a Nancy IRegistrations with the following code:

var configOpts =  new ConfigurationOptions {
  EndPoints = {
    RedisHost,
  },
  Password = RedisPass,
  AllowAdmin = false,
  ClientName = ApplicationName,
  ConnectTimeout = 10000,
  SyncTimeout = 5000,
};
var mux = ConnectionMultiplexer.Connect(configOpts);
yield return new InstanceRegistration(typeof (ConnectionMultiplexer), mux);

mux is the instance that gets shared by all code requesting it in their constructor parameter list.

I have a class called SchemaCache. A small piece of it (that includes the code that throws the error in question) follows:

public SchemaCache(ConnectionMultiplexer connectionMultiplexer) {
    ConnectionMultiplexer = connectionMultiplexer;
}

private ConnectionMultiplexer ConnectionMultiplexer { get; set; }

private async Task<string[]> Cached(string key, bool forceFetch, Func<string[]> fetch) {
    var db = ConnectionMultiplexer.GetDatabase();

    return forceFetch || !await db.KeyExistsAsync(key)
        ? await CacheSetSet(db, key, await Task.Run(fetch))
        : await CacheGetSet(db, key);
}

private static async Task<string[]> CacheSetSet(IDatabaseAsync db, string key, string[] values) {
    await db.KeyDeleteAsync(key);
    await db.SetAddAsync(key, EmptyCacheSentinel);

    var keysSaved = values
        .Append(EmptyCacheSentinel)
        .Select(val => db.SetAddAsync(key, val))
        .ToArray()
        .Append(db.KeyExpireAsync(key, TimeSpan.FromDays(1)));
    await Task.WhenAll(keysSaved);

    return values;
}

private static async Task<string[]> CacheGetSet(IDatabaseAsync db, string key) {
    var results = await db.SetMembersAsync(key);
    return results.Select(rv => (string) rv).Without(EmptyCacheSentinel).ToArray();
}

// There are a bunch of these public methods:
public async Task<IEnumerable<string>> UseCache1(bool forceFetch = false) {
    return await Cached("the_key_i_want", forceFetch, () => {
        using (var cnn = MakeConnectionToDatabase("server", "databaseName")) {
            // Uses Dapper:
            return cnn.Query<string>("--expensive sql query").ToArray();
        }
    });
}

I also have a class that makes uses of this in a method requiring some of the information from the cache:

public OtherClass(SchemaCache cache) {
    Cache = cache;
}

private SchemaCache Cache { get; set; }

public Result GetResult(Parameter parameter) {
    return Cache.UseCache1().Result
        .Where(r => Cache.UseCache2(r).Result.Contains(parameter))
        .Select(r => CheckResult(r))
        .FirstOrDefault(x => x != null);
}

All of the above works fine in LinqPad where there's only one instance of everything in question. Instead it fails with a TimeoutException (and later an exception about no available connection). The only difference is that I get an instance of the cache via dependency injection, and I'm pretty sure Nancy uses Tasks to parallelize the requests.

1
Still trying to summarize.Chris Pfohl
Are you by any chance using a transaction or batch? There's a really simple way to deadlock yourself there (that is just as easily fixed)Marc Gravell♦
@MarcGravell Nope...still working up the rest of the example...there's a lot of moving pieces...Chris Pfohl
@ChristopherPfohl that basically says the muxer isn't active in any way; everything has completed and handed back to either the caller (for sync) or the TPL (for async). I wonder if this is simply: not using TPL quite correctly. I would have to see the calling code to comment.Marc Gravell♦
btw; if you're already in async land; dapper has an async API too ;pMarc Gravell♦

1 Answers

14
votes

Here we go:

return Cache.UseCache1().Result

boom; deadlock. But nothing to do with StackExchange.Redis.

At least, from most sync-context providers. This is because your awaits are all implicitly requesting sync-context activation - which can mean "on the UI thread" (winforms, WPF), or "on the currently designated worker thread" (WCF, ASP.NET, MVC, etc). The problem here is that this thread will never be available to process those items, because .Result is a synchronous and blocking call. Thus none of your completion callbacks will get processed, because the only thread that can possibly process them is waiting for them to finish before making itself available.

Note: StackExchange.Redis does not use the sync-context; it explicitly disconnects from sync-context to avoid being the cause of deadlocks (this is normal and recommended for libraries). The key point is that your code doesn't.

Options:

  • don't mix async and .Result / .Wait(), or
  • have all of your await calls (or at least those underneath .UseCache1()) explicitly call .ConfigureAwait(false) - note, however, that this means that the completions will not occur in the calling context!

The first option is the simplest; if you can isolate a tree of calls that do not depend on sync-context then the second approach is viable.

This is a very common problem; I did pretty much the same thing.