3
votes

What are the cases a queue manager can loose its connectivy to repository in cluster encironment? I have an environment where a queue manager is losing its connectivity to repository often and i need to refresh the cluster to fix this and to re-establish communication with other queue manager in the cluster.

Our cluster has 100 queue managers and we have 2 repositories in it.

1
What exactly do you mean by "a queue manager is losing its connectivity to repository often?" Channel goes to retry? Repository no longer shows up in DIS CLUSQMGR? Cluster member no longer shows up at repository DIS CLUSQMGR? What version of WMQ and what do the error logs show when this happens?T.Rob
Hello Rob, I didn't do this check's during the issue.I just tried to put test msg to remote q and it is failing with MQRC 2087 error code even though the queue exists on remote server. After a cluster refresh it is working fine. I will do these check when i face it again. We have our repository in MQV7 and all other servers on MQV6.Vignesh
Channels are not in retrying state.Vignesh

1 Answers

2
votes

There are a few different issues that can cause this. One is if there are explicitly defined CLUSSDR channels pointing to a non-repository QMgr. This causes repository messages to arrive at the non-repos QMgr which can cause its amqrrmfa repository process to die. Another is that there have been a few APARS (such as this one) which can lead to that process dieing. The solutions, respectively, are to fix the configuration issues or to apply the latest Fix Pack. Another issue, less commonly seen, is that a message to a new QMgr will error out before the new QMgr can resolve to the local QMgr. In this case, the REFRESH doesn't actually cause the remote QMgr to resolve, it just provides time for the resolution to complete.

Debugging this involves isolating the possible causes. Check that amqrrmfa is running. Check that all non-repository QMgrs have one and ONLY one explicitly defined CLUSSDR channel. Verify that all repositories have one and ONLY one explicitly defined CLUSSDR to each other repository. If overlapping clusters are used make sure to NOT overlap the channels. This means avoiding channel names like TO.QMGR and prefer names like CLUSTER.QMGR. Verify this by insuring channels do not use the CLUSNL attribute and use the CLUSTER attribute instead. Finally, reconcile the objects in both repositories and the non-repository by issuing DIS CLUSQMGR(*) and DIS QCLUSTER(*). The repositories should have identical object inventories. If that's wrong then there's the problem. The non-repository should have an entry for every QMgr it has previously talked to.

One thing I have seen in the past was that an administrator had scheduled a REFRESH CLUSTER. His thinking was that this was something they needed to do to fix the cluster so why not run it on a regular basis? So he scheduled it to run daily. Then each night it made the QMgr forget about the other QMgrs in the cluster and the first time an app resolved a remote QMgr each day there was a flurry of repository traffic. This caused enough of a delay that there were a few 2087 errors each morning. Not that you would do such a thing. :-)