7
votes

We have clustered MSMQ for a set of NServiceBus services, and everything runs great until it doesn't. Outgoing queues on one server start filling up, and pretty soon the whole system is hung.

More details:

We have a clustered MSMQ between servers N1 and N2. Other clustered resources are only services that operate directly on the clustered queues as local, i.e. NServiceBus distributors.

All of the worker processes live on separate servers, Services3 and Services4.

For those unfamiliar with NServiceBus, work goes into a clustered work queue managed by the distributor. Worker apps on Service3 and Services4 send "I'm Ready for Work" messages to a clustered control queue managed by the same distributor, and the distributor responds by sending a unit of work to the worker process's input queue.

At some point, this process can get completely hung. Here is a picture of the outgoing queues on the clustered MSMQ instance when the system is hung:

Clustered MSMQ Outgoing Queues in Hung State

If I fail over the cluster to the other node, it's like the whole system gets a kick in the pants. Here is a picture of the same clustered MSMQ instance shortly after a failover:

Clustered MSMQ Outgoing Queues After Failover

Can anyone explain this behavior, and what I can do to avoid it, to keep the system running smoothly?

3
Does the secondary node eventually hang? How are the workers acting? Are they actively processing messages?Adam Fyles
It doesn't happen often enough that I can authoritatively say it happens on only one node or both. The workers are behaving - they are actively processing messages when there are messages in their local input queues to process.David Boike
Weird. How often does it happen? How many NIC cards does each node have? I'm wondering if MSMQ is getting confused as to which card to use and therefore is occasionally not completing the ACKs back. There should be a registry setting to lock it in.Adam Fyles
It happens maybe 2-3 times per week. All servers involved (cluster nodes and worker nodes) are virtualized on VSphere. The clustered nodes are each on VSphere guests on separate hosts. In their virtual configurations, each server only has one NIC card. Of course with the clustered services, there are multiple IP addresses bouncing around.David Boike
Did you ever figure this out? It's almost as if something is taking the node away from the Distributor.Adam Fyles

3 Answers

2
votes

Maybe your servers were cloned and thus share the same Queue Manager ID (QMId).

MSMQ use the QMId as a hash for caching the address of remote machines. If more than one machine has the same QMId in your network you could end up with stuck or missing messages.

Check out the explanation and solution in this blog post: http://blogs.msdn.com/b/johnbreakwell/archive/2007/02/06/msmq-prefers-to-be-unique.aspx

2
votes

Over a year later, it seems that our issue has been resolved. The key takeaways seem to be:

  • Make sure you have a solid DNS system so when MSMQ needs to resolve a host, it can.
  • Only create one clustered instance of MSMQ on a Windows Failover Cluster.

When we set up our Windows Failover Cluster, we made the assumption that it would be bad to "waste" resources on the inactive node, and so, having two quasi-related NServiceBus clusters at the time, we made a clustered MSMQ instance for Project1, and another clustered MSMQ instance for Project2. Most of the time, we figured, we would run them on separate nodes, and during maintenance windows they would co-locate on the same node. After all, this was the setup we have for our primary and dev instances of SQL Server 2008, and that has been working quite well.

At some point I began to grow dubious about this approach, especially since failing over each MSMQ instance once or twice seemed to always get messages moving again.

I asked Udi Dahan (author of NServiceBus) about this clustered hosting strategy, and he gave me a puzzled expression and asked "Why would you want to do something like that?" In reality, the Distributor is very light-weight, so there's really not much reason to distribute them evenly among the available nodes.

After that, we decided to take everything we had learned and recreate a new Failover Cluster with only one MSMQ instance. We have not seen the issue since. Of course, making sure this problem is solved would be proving a negative, and thus impossible. It hasn't been an issue for at least 6 months, but who knows, I suppose it could fail tomorrow! Let's hope not.

1
votes

How are your endpoints configured to persist their subscriptions?

What if one (or more) of your service encounters an error and is restartet by the Failoverclustermanager? In this case, this service would never receive one of the "I'm Ready for Work" message from the other services again.

When you fail over to the other node, I guess that all your services send these messages again and, as a result, everything gets back working.

To test this behavior do the following.

  1. Stop and restart all your services.
  2. Stop only one of the services.
  3. Restart the stopped service.
  4. If your system does not hang, repeat this with each single service.

If your system now hangs again, check your configurations. It this scenario your at least one, if not all, services lose the subscriptions between restarts. If you did not do so already, persist the subscription in a database.