2
votes

We have bought a lot of licenses, done a lot of testing with a lot of promissing results and are on the brink of our first release :).

But now we have hit a big bump on the road, meaning we might have to abandon the bus, if we cant explain and fix it :/.

Our Distributor suddenly have control error messages like the one below:

<?xml version="1.0"?>
<ArrayOfHeaderInfo xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
    <HeaderInfo>
        <Key>NServiceBus.ControlMessage</Key>
        <Value>True</Value>
    </HeaderInfo>
    <HeaderInfo>
        <Key>NServiceBus.Distributor.WorkerCapacityAvailable</Key>
        <Value>20</Value>
    </HeaderInfo>
    <HeaderInfo>
        <Key>NServiceBus.Distributor.WorkerStarting</Key>
        <Value>True</Value>
    </HeaderInfo>
    <HeaderInfo>
        <Key>CorrId</Key>
        <Value>58dd98f5-9ac0-44fb-8604-3a0f06787a35\295075</Value>
    </HeaderInfo>
    <HeaderInfo>
        <Key>NServiceBus.ExceptionInfo.Reason</Key>
        <Value>ProcessingFailed</Value>
    </HeaderInfo>
    <HeaderInfo>
        <Key>NServiceBus.ExceptionInfo.ExceptionType</Key>
        <Value>System.InvalidOperationException</Value>
    </HeaderInfo>
    <HeaderInfo>
        <Key>NServiceBus.ExceptionInfo.HelpLink</Key>
    </HeaderInfo>
    <HeaderInfo>
        <Key>NServiceBus.ExceptionInfo.Message</Key>
        <Value>Property ResponseQueue was not retrieved when receiving the message. Ensure that the PropertyFilter is set correctly.</Value>
    </HeaderInfo>
    <HeaderInfo>
        <Key>NServiceBus.ExceptionInfo.Source</Key>
        <Value>NServiceBus.Core</Value>
    </HeaderInfo>
    <HeaderInfo>
        <Key>NServiceBus.ExceptionInfo.StackTrace</Key>
        <Value>   at NServiceBus.Unicast.Transport.Transactional.TransactionalTransport.ProcessMessage(TransportMessage m) in c:\BuildAgent\work\nsb.master_6\src\impl\unicast\transport\NServiceBus.Unicast.Transport.Transactional\TransactionalTransport.cs:line 312
   at NServiceBus.Unicast.Transport.Transactional.TransactionalTransport.ReceiveMessage() in c:\BuildAgent\work\nsb.master_6\src\impl\unicast\transport\NServiceBus.Unicast.Transport.Transactional\TransactionalTransport.cs:line 275
   at NServiceBus.Utils.TransactionWrapper.RunInTransaction(Action callback, IsolationLevel isolationLevel, TimeSpan transactionTimeout) in c:\BuildAgent\work\nsb.master_6\src\utils\TransactionWrapper.cs:line 32
   at NServiceBus.Unicast.Transport.Transactional.TransactionalTransport.Process() in c:\BuildAgent\work\nsb.master_6\src\impl\unicast\transport\NServiceBus.Unicast.Transport.Transactional\TransactionalTransport.cs:line 220</Value>
    </HeaderInfo>
    <HeaderInfo>
        <Key>NServiceBus.OriginalId</Key>
        <Value>58dd98f5-9ac0-44fb-8604-3a0f06787a35\295075</Value>
    </HeaderInfo>
    <HeaderInfo>
        <Key>NServiceBus.FailedQ</Key>
        <Value>someservice.processId.distributor.control@testservices01</Value>
    </HeaderInfo>
    <HeaderInfo>
        <Key>NServiceBus.TimeOfFailure</Key>
        <Value>2013-04-30 10:07:40:750707 Z</Value>
    </HeaderInfo>
</ArrayOfHeaderInfo>

Google tells us that this might relate to some threading issues and maybe even the way that NSB is implemented using peek/receive.

The above exception relates to this file at GitHub: https://github.com/NServiceBus/NServiceBus/blob/master/src/impl/unicast/transport/NServiceBus.Unicast.Transport.Transactional/TransactionalTransport.cs

Details about our implementation:

We use a custom IManageUnitsOfWork, due to some legacy issue which means no DTC against the DB yet. I do not think that this could be the cause, but i think its worth mentioning. This is the implemenation:

public class ManagedUnitOfWorkWithDtcSuppression : IManageUnitsOfWork
{
    private readonly IContainer _container;
    private IUnitOfWork _unitOfWork;
    private readonly TransactionScope _scope;

    public ManagedUnitOfWorkWithDtcSuppression()
    {
        _scope = new TransactionScope(TransactionScopeOption.Suppress);
        _container = ObjectFactory.GetInstance<IContainer>();
    }

    public void Begin()
    {
        _unitOfWork = _container.GetInstance<IUnitOfWork>();
    }

    public void End(Exception exception = null)
    {
        if (exception == null)
        {
            _unitOfWork.Commit();
        }

        _unitOfWork.Dispose();
        _scope.Complete();
        _scope.Dispose();
    }
}

Also we have a special setup where we run 4 identical AppDomains inside 1 service running, which means that when we start the service as a distributor, there are actually 4 distributors running. But these are pr. definition completely isolated from one another. The IBus is unique for each AppDomain, this has been tested.

Our Distributor configuration looks like this:

        return NServiceBus.Configure.With()
            .DefineEndpointName(queuePrefix)
            .Log4Net(ObjectFactory.GetInstance<IServiceBusLog>().Build())
            .StructureMapBuilder()
            .JsonSerializer()
            .AsMasterNode()
            .RunDistributorWithNoWorkerOnItsEndpoint()
            .MsmqTransport()
            .IsTransactional(true)
            .DisableTimeoutManager()
            .DisableSecondLevelRetries()
            .UnicastBus()
            .CreateBus()
            .Start(() => NServiceBus.Configure.Instance.ForInstallationOn<NServiceBus.Installation.Environments.Windows>().Install());

Question:

What is going on here?

Are we screwing with NSB because we are using DTC supression, is there a MSMQ bug or is there a NSB bug?

1
Which version of NServiceBus is this on?Udi Dahan
Version 3.3.5 of the bus.Christian Mikkelsen
Units of work doesn't come into play for the control messages going to the distributor. Something seems to be wrong with either the control queue or the control message coming from the worker. Can you start one of the workers and use Queue Explorer to dump the control message to file so that we can take a look?Andreas Öhlund
Of course :). I still have the messages from the last time this happened and will send them as I do not know how to recreate the incident yet. Where do you want them, at the official NSB support email, dropbox or?Christian Mikkelsen
One thing that seems odd is the fact that the 3 control error msg's I have all have identical first parts of the CorrId (58dd98f5-9ac0-44fb-8604-3a0f06787a35/XXXXXX), allthough they are running in seperate AppDomains, within the same service. We are creating/starting the bus using StructureMap and each bus have a unique Hashcode in their app domain. Could this be the source of the issue? Is there a seperate "singleton" distributor process behind the curtains which might cause issues with our setup (wild guess)?Christian Mikkelsen

1 Answers

0
votes

Let me jump in with an answer a year late! :) I'm pretty sure you're seeing https://github.com/Particular/NServiceBus/pull/2250. Basically Microsoft made changes to the implementation of MessageQueue between .NET 3.5 and .NET 4, making the NSB code non-thread safe. This was fixed in https://github.com/Particular/NServiceBus/releases/tag/3.3.10