Deadlocks causing 'Server failed to resume the transaction' with NHibernate and distributed transactions

Question

We are having an issue when using NHibernate with distributed transactions.

Consider the following snippet:

//
// There is already an ambient distributed transaction
//
using(var scope = new TransactionScope()) {
    using(var session = _sessionFactory.OpenSession())
    using(session.BeginTransaction()) {
        using(var cmd = new SqlCommand(_simpleUpdateQuery, (SqlConnection)session.Connection)) {
            cmd.ExecuteNonQuery();
        }

        session.Save(new SomeEntity());
        session.Transaction.Commit();
    }
    scope.Complete();
}

Sometimes, when the server is under extreme load, we'll see the following:

The query executed with cmd.ExecuteNonQuery is chosen as a deadlock victim (we can see it in SQL Profiler), but no exception is raised.
session.Save fails with the error message, "The operation is not valid for the state of the transaction."
Every time this code is executed after that, session.BeginTransaction fails. The first few times, the inner exception varies (sometimes it is the deadlock exception that should have been raised in step 1). Eventually it stabilizes to "The server failed to resume the transaction. Desc:3800000177." or "New request is not allowed to start because it should come with valid transaction descriptor."

If left alone, the application will eventually (after seconds or minutes) recover from this condition.

Why is the deadlock exception not being reported in step 1? And if we can't resolve that, then how can we prevent our application from temporarily becoming unusable?

The issue has been reproduced in the following environments

Windows 7 x64 and Windows Server 2003 x86
SQL Server 2005 and 2008
.NET 4.0 and 3.5
NHibernate 3.2, 3.1 and 2.1.2

I've created a test fixture which will sometimes reproduce the issue for us. It is available here: http://wikiupload.com/EWJIGAECG9SQDMZ

I just addressed a problem very similar to this. What is the lifestyle of the session? — CrazyDart
The SessionFactory is registered as a singleton and created with a factory method. The container does not provide the ISession; it is provided by SessionFactory.GetCurrentSession(). For this we're using WcfOperationSessionContext stolen from the NH3.0 source. — jon without an h
Hmmm, well that is just a bit different from what we are doing. Might I suggest you wrap that session with a using, and perhaps the transaction also? Maybe those dispose methods are not cleaning up correctly when the transaction isnt fully committed? Because the method has a transaction, NHibernate should use the same transaction, right? So a dispose on the Transaction might not actually dispose. Just a thought. — CrazyDart
Please see my latest edits - we've managed to simplify the problem scenario dramatically. — jon without an h

jon without an h jon without an h · Accepted Answer · 2012-01-17T20:47:26

We've finally narrowed this down to a cause.

When opening a session, if there is an ambient distributed transaction, NHibernate attaches an event handler to the Transaction.TransactionCompleted, which closes the session when the distributed transaction is completed. This appears to be subject to a race condition wherein the connection may be closed and returned to the pool before the deadlock error propagates across, leaving the connection in an unusable state.

The following code will reproduce the error for us occasionally, even without any load on the server. If there is extreme load on the server, it becomes more consistent.

using(var scope = new TransactionScope()) {
    //
    // Force promotion to distributed transaction
    //
    TransactionInterop.GetTransmitterPropagationToken(Transaction.Current);

    var connection = new SqlConnection(_connectionString);
    connection.Open();

    //
    // Close the connection once the distributed transaction is
    // completed.
    //
    Transaction.Current.TransactionCompleted += 
        (sender, e) => connection.Close();

    using(connection.BeginTransaction())
        //
        // Deadlocks but sometimes does not raise exception
        //
        ForceDeadlockOnConnection(connection);

    scope.Complete();
}

//
// Subsequent attempts to open a connection with the same
// connection string will fail
//

We have not settled on a solution, but the following things will eliminate the problem (while possibly having other consequences):

Turning off connection pooling
Using NHibernate's AdoNetTransactionFactory instead of AdoNetWithDistributedTransactionFactory
Adding error handling that calls SqlConnection.ClearPool() when the "server failed to resume the transaction" error occurs

According to Microsoft (https://connect.microsoft.com/VisualStudio/feedback/details/722659/), the SqlConnection class is not thread-safe, and that includes closing the connection on a separate thread. Based on this response we have filed a bug report for NHibernate (http://nhibernate.jira.com/browse/NH-3023).

Deadlocks causing 'Server failed to resume the transaction' with NHibernate and distributed transactions

4 Answers