0
votes

I have an Artemis 2.11.0 HA pair configured using shared storage over NFS (but I don't know the mount options). Here's the master's broker.xml:

<?xml version='1.0'?>
<configuration xmlns="urn:activemq"
               xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
               xmlns:xi="http://www.w3.org/2001/XInclude"
               xsi:schemaLocation="urn:activemq /schema/artemis-configuration.xsd">

   <core xmlns="urn:activemq:core" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="urn:activemq:core ">

      <name>0.0.0.0</name>

      <persistence-enabled>true</persistence-enabled>

      <journal-type>NIO</journal-type>

      <paging-directory>\\test\data\paging</paging-directory>

      <bindings-directory>\\test\data\bindings</bindings-directory>

      <journal-directory>\\test\data\journal</journal-directory>

      <large-messages-directory>\\test\data\large-messages</large-messages-directory>

      <journal-datasync>true</journal-datasync>

      <journal-min-files>2</journal-min-files>

      <journal-pool-files>10</journal-pool-files>

      <journal-device-block-size>4096</journal-device-block-size>

      <journal-file-size>10M</journal-file-size>
      
      <journal-buffer-timeout>752000</journal-buffer-timeout>

      <journal-max-io>1</journal-max-io>

      <disk-scan-period>5000</disk-scan-period>

      <max-disk-usage>90</max-disk-usage>

      <!-- should the broker detect dead locks and other issues -->
      <critical-analyzer>false</critical-analyzer>

      <critical-analyzer-timeout>120000</critical-analyzer-timeout>

      <critical-analyzer-check-period>60000</critical-analyzer-check-period>

      <critical-analyzer-policy>HALT</critical-analyzer-policy>

      
      <page-sync-timeout>1028000</page-sync-timeout>

      <global-max-size>4096Mb</global-max-size>
     
      <ha-policy>
         <shared-store>
            <master>
               <failover-on-shutdown>true</failover-on-shutdown>
            </master>
         </shared-store>
      </ha-policy>

      <connectors> 
         <connector name="netty">tcp://ip-address:61617</connector>
      </connectors>

      <broadcast-groups>
         <broadcast-group name="teamsMQ-broadcast-group">
            <local-bind-address>ip-address</local-bind-address>
            <local-bind-port>9877</local-bind-port>
            <group-address>224.0.0.1</group-address>
            <group-port>9876</group-port>
            <broadcast-period>2000</broadcast-period>
            <connector-ref>netty</connector-ref>
        </broadcast-group>
      </broadcast-groups>
 
      <discovery-groups>
         <discovery-group name="teamsMQ-discovery-group">
            <local-bind-address>ip-address</local-bind-address>
            <group-address>224.0.0.1</group-address>
            <group-port>9876</group-port>
            <refresh-timeout>10000</refresh-timeout>
         </discovery-group>
      </discovery-groups>

      <acceptors>
         <acceptor name="netty">tcp://ip-address:61617</acceptor>

         <!-- Acceptor for every supported protocol -->
         <acceptor name="artemis">tcp://0.0.0.0:61616?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;protocols=CORE,AMQP,STOMP,HORNETQ,MQTT,OPENWIRE;useEpoll=true;amqpCredits=1000;amqpLowCredits=300;amqpDuplicateDetection=true</acceptor>

         <!-- AMQP Acceptor.  Listens on default AMQP port for AMQP traffic.-->
         <acceptor name="amqp">tcp://0.0.0.0:5672?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;protocols=AMQP;useEpoll=true;amqpCredits=1000;amqpLowCredits=300;amqpDuplicateDetection=true</acceptor>

         <!-- STOMP Acceptor. -->
         <acceptor name="stomp">tcp://0.0.0.0:61613?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;protocols=STOMP;useEpoll=true</acceptor>

         <!-- HornetQ Compatibility Acceptor.  Enables HornetQ Core and STOMP for legacy HornetQ clients. -->
         <acceptor name="hornetq">tcp://0.0.0.0:5445?anycastPrefix=jms.queue.;multicastPrefix=jms.topic.;protocols=HORNETQ,STOMP;useEpoll=true</acceptor>

         <!-- MQTT Acceptor -->
         <acceptor name="mqtt">tcp://0.0.0.0:1883?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;protocols=MQTT;useEpoll=true</acceptor>
      </acceptors>
      <security-settings>
         <security-setting match="#">
            <permission type="createNonDurableQueue" roles="amq"/>
            <permission type="deleteNonDurableQueue" roles="amq"/>
            <permission type="createDurableQueue" roles="amq"/>
            <permission type="deleteDurableQueue" roles="amq"/>
            <permission type="createAddress" roles="amq"/>
            <permission type="deleteAddress" roles="amq"/>
            <permission type="consume" roles="amq"/>
            <permission type="browse" roles="amq"/>
            <permission type="send" roles="amq"/>
            <!-- we need this otherwise ./artemis data imp wouldn't work -->
            <permission type="manage" roles="amq"/>
         </security-setting>
      </security-settings>

      <address-settings>
         <!-- if you define auto-create on certain queues, management has to be auto-create -->
         <address-setting match="activemq.management#">
            <dead-letter-address>DLQ</dead-letter-address>
            <expiry-address>ExpiryQueue</expiry-address>
            <redelivery-delay>0</redelivery-delay>
            <!-- with -1 only the global-max-size is in use for limiting -->
            <max-size-bytes>-1</max-size-bytes>
            <message-counter-history-day-limit>10</message-counter-history-day-limit>
            <address-full-policy>PAGE</address-full-policy>
            <auto-create-queues>true</auto-create-queues>
            <auto-create-addresses>true</auto-create-addresses>
            <auto-create-jms-queues>true</auto-create-jms-queues>
            <auto-create-jms-topics>true</auto-create-jms-topics>
         </address-setting>
         <!--default for catch all-->
         <address-setting match="#">
            <dead-letter-address>DLQ</dead-letter-address>
            <expiry-address>ExpiryQueue</expiry-address>
            <redelivery-delay>0</redelivery-delay>
            <!-- with -1 only the global-max-size is in use for limiting -->
            <max-size-bytes>-1</max-size-bytes>
            <message-counter-history-day-limit>10</message-counter-history-day-limit>
            <address-full-policy>PAGE</address-full-policy>
            <auto-create-queues>true</auto-create-queues>
            <auto-create-addresses>true</auto-create-addresses>
            <auto-create-jms-queues>true</auto-create-jms-queues>
            <auto-create-jms-topics>true</auto-create-jms-topics>
         </address-setting>
      </address-settings>

      <addresses>
         <address name="DLQ">
            <anycast>
               <queue name="DLQ" />
            </anycast>
         </address>
         <address name="ExpiryQueue">
            <anycast>
               <queue name="ExpiryQueue" />
            </anycast>
         </address>
      </addresses>
   </core>
</configuration>

And the slave's:

<?xml version='1.0'?>
<configuration xmlns="urn:activemq"
               xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
               xmlns:xi="http://www.w3.org/2001/XInclude"
               xsi:schemaLocation="urn:activemq /schema/artemis-configuration.xsd">

   <core xmlns="urn:activemq:core" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="urn:activemq:core ">

      <name>0.0.0.0</name>

      <persistence-enabled>true</persistence-enabled>

      <journal-type>NIO</journal-type>

      <paging-directory>\\test\data\paging</paging-directory>

      <bindings-directory>\\test\data\bindings</bindings-directory>

      <journal-directory>\\test\data\journal</journal-directory>

      <large-messages-directory>\\test\data\large-messages</large-messages-directory>

      <journal-datasync>true</journal-datasync>

      <journal-min-files>2</journal-min-files>

      <journal-pool-files>10</journal-pool-files>

      <journal-device-block-size>4096</journal-device-block-size>

      <journal-file-size>10M</journal-file-size>
      
      <journal-buffer-timeout>688000</journal-buffer-timeout>

      <journal-max-io>1</journal-max-io>

      <disk-scan-period>5000</disk-scan-period>

      <max-disk-usage>90</max-disk-usage>

      <!-- should the broker detect dead locks and other issues -->
      <critical-analyzer>false</critical-analyzer>

      <critical-analyzer-timeout>120000</critical-analyzer-timeout>

      <critical-analyzer-check-period>60000</critical-analyzer-check-period>

      <critical-analyzer-policy>HALT</critical-analyzer-policy>
      
      <page-sync-timeout>1028000</page-sync-timeout>

      <global-max-size>4096Mb</global-max-size>
      
      <ha-policy>
         <shared-store>
            <slave>
               <failover-on-shutdown>true</failover-on-shutdown>
               <allow-failback>true</allow-failback>
            </slave>
         </shared-store>
      </ha-policy>

      <connectors> 
         <connector name="netty">tcp://ip-address:61617</connector>
      </connectors>

      <broadcast-groups>
         <broadcast-group name="teamsMQ-broadcast-group">
            <local-bind-address>ip-address</local-bind-address>
            <local-bind-port>9877</local-bind-port>
            <group-address>224.0.0.1</group-address>
            <group-port>9876</group-port>
            <broadcast-period>2000</broadcast-period>
            <connector-ref>netty</connector-ref>
        </broadcast-group>
      </broadcast-groups>
 
      <discovery-groups>
         <discovery-group name="teamsMQ-discovery-group">
            <local-bind-address>ip-address</local-bind-address>
            <group-address>224.0.0.1</group-address>
            <group-port>9876</group-port>
            <refresh-timeout>10000</refresh-timeout>
         </discovery-group>
      </discovery-groups>

      <acceptors>
         <acceptor name="netty">tcp://ip-address:61617</acceptor>

         <!-- Acceptor for every supported protocol -->
         <acceptor name="artemis">tcp://0.0.0.0:61616?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;protocols=CORE,AMQP,STOMP,HORNETQ,MQTT,OPENWIRE;useEpoll=true;amqpCredits=1000;amqpLowCredits=300;amqpDuplicateDetection=true</acceptor>

         <!-- AMQP Acceptor.  Listens on default AMQP port for AMQP traffic.-->
         <acceptor name="amqp">tcp://0.0.0.0:5672?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;protocols=AMQP;useEpoll=true;amqpCredits=1000;amqpLowCredits=300;amqpDuplicateDetection=true</acceptor>

         <!-- STOMP Acceptor. -->
         <acceptor name="stomp">tcp://0.0.0.0:61613?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;protocols=STOMP;useEpoll=true</acceptor>

         <!-- HornetQ Compatibility Acceptor.  Enables HornetQ Core and STOMP for legacy HornetQ clients. -->
         <acceptor name="hornetq">tcp://0.0.0.0:5445?anycastPrefix=jms.queue.;multicastPrefix=jms.topic.;protocols=HORNETQ,STOMP;useEpoll=true</acceptor>

         <!-- MQTT Acceptor -->
         <acceptor name="mqtt">tcp://0.0.0.0:1883?tcpSendBufferSize=1048576;tcpReceiveBufferSize=1048576;protocols=MQTT;useEpoll=true</acceptor>

      </acceptors>


      <security-settings>
         <security-setting match="#">
            <permission type="createNonDurableQueue" roles="amq"/>
            <permission type="deleteNonDurableQueue" roles="amq"/>
            <permission type="createDurableQueue" roles="amq"/>
            <permission type="deleteDurableQueue" roles="amq"/>
            <permission type="createAddress" roles="amq"/>
            <permission type="deleteAddress" roles="amq"/>
            <permission type="consume" roles="amq"/>
            <permission type="browse" roles="amq"/>
            <permission type="send" roles="amq"/>
            <!-- we need this otherwise ./artemis data imp wouldn't work -->
            <permission type="manage" roles="amq"/>
         </security-setting>
      </security-settings>

      <address-settings>
         <!-- if you define auto-create on certain queues, management has to be auto-create -->
         <address-setting match="activemq.management#">
            <dead-letter-address>DLQ</dead-letter-address>
            <expiry-address>ExpiryQueue</expiry-address>
            <redelivery-delay>0</redelivery-delay>
            <!-- with -1 only the global-max-size is in use for limiting -->
            <max-size-bytes>-1</max-size-bytes>
            <message-counter-history-day-limit>10</message-counter-history-day-limit>
            <address-full-policy>PAGE</address-full-policy>
            <auto-create-queues>true</auto-create-queues>
            <auto-create-addresses>true</auto-create-addresses>
            <auto-create-jms-queues>true</auto-create-jms-queues>
            <auto-create-jms-topics>true</auto-create-jms-topics>
         </address-setting>
         <!--default for catch all-->
         <address-setting match="#">
            <dead-letter-address>DLQ</dead-letter-address>
            <expiry-address>ExpiryQueue</expiry-address>
            <redelivery-delay>0</redelivery-delay>
            <!-- with -1 only the global-max-size is in use for limiting -->
            <max-size-bytes>-1</max-size-bytes>
            <message-counter-history-day-limit>10</message-counter-history-day-limit>
            <address-full-policy>PAGE</address-full-policy>
            <auto-create-queues>true</auto-create-queues>
            <auto-create-addresses>true</auto-create-addresses>
            <auto-create-jms-queues>true</auto-create-jms-queues>
            <auto-create-jms-topics>true</auto-create-jms-topics>
         </address-setting>
      </address-settings>

      <addresses>
         <address name="DLQ">
            <anycast>
               <queue name="DLQ" />
            </anycast>
         </address>
         <address name="ExpiryQueue">
            <anycast>
               <queue name="ExpiryQueue" />
            </anycast>
         </address>
      </addresses>
   </core>
</configuration>

The master node goes down for an unknown reason. The below log gets printed continuously:

AMQ222154: Error checking DLQ: ActiveMQShutdownException[errorType=SHUTDOWN_ERROR message=Journal must be in state=LOADED, was [STOPPED]]
2021-01-15 23:02:05,414 WARN  [org.apache.activemq.artemis.core.server] AMQ222154: Error checking DLQ: ActiveMQShutdownException[errorType=SHUTDOWN_ERROR message=Journal must be in state=LOADED, was [STOPPED]] 
    at org.apache.activemq.artemis.core.journal.impl.JournalImpl.checkJournalIsLoaded(JournalImpl.java:1087) [artemis-journal-2.11.0.jar:2.11.0] 
    at org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendUpdateRecord(JournalImpl.java:886) [artemis-journal-2.11.0.jar:2.11.0] 
    at org.apache.activemq.artemis.core.journal.Journal.appendUpdateRecord(Journal.java:98) [artemis-journal-2.11.0.jar:2.11.0] 
    at org.apache.activemq.artemis.core.persistence.impl.journal.AbstractJournalStorageManager.updateDeliveryCount(AbstractJournalStorageManager.java:756) [artemis-server-2.11.0.jar:2.11.0] 
    at org.apache.activemq.artemis.core.server.impl.QueueImpl.checkRedelivery(QueueImpl.java:3052) [artemis-server-2.11.0.jar:2.11.0] 
    at org.apache.activemq.artemis.core.server.impl.RefsOperation.rollbackRedelivery(RefsOperation.java:166) [artemis-server-2.11.0.jar:2.11.0] 
    at org.apache.activemq.artemis.core.server.impl.RefsOperation.afterRollback(RefsOperation.java:113) [artemis-server-2.11.0.jar:2.11.0] 
    at org.apache.activemq.artemis.core.transaction.impl.TransactionImpl.afterRollback(TransactionImpl.java:589) [artemis-server-2.11.0.jar:2.11.0] 
    at org.apache.activemq.artemis.core.transaction.impl.TransactionImpl.access$200(TransactionImpl.java:40) [artemis-server-2.11.0.jar:2.11.0] 
    at org.apache.activemq.artemis.core.transaction.impl.TransactionImpl$4.done(TransactionImpl.java:442) [artemis-server-2.11.0.jar:2.11.0] 
    at org.apache.activemq.artemis.core.persistence.impl.journal.OperationContextImpl$1.run(OperationContextImpl.java:244) [artemis-server-2.11.0.jar:2.11.0] 
    at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42) [artemis-commons-2.11.0.jar:2.11.0] 
    at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31) [artemis-commons-2.11.0.jar:2.11.0] 
    at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66) [artemis-commons-2.11.0.jar:2.11.0] 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_275] 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_275] 
    at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118) [artemis-commons-2.11.0.jar:2.11.0]

The Slave comes up as expected, but throws an NPE:

2021-01-15 23:02:27,529 INFO  [org.apache.activemq.artemis.core.server] AMQ221010: Backup Server is now live
2021-01-15 23:02:27,545 ERROR [org.apache.activemq.artemis.core.server] AMQ224000: Failure in initialisation: java.lang.NullPointerException 
    at org.apache.activemq.artemis.core.server.impl.SharedStoreBackupActivation$FailbackChecker.<init>(SharedStoreBackupActivation.java:193) [artemis-server-2.11.0.jar:2.11.0] 
    at org.apache.activemq.artemis.core.server.impl.SharedStoreBackupActivation.startFailbackChecker(SharedStoreBackupActivation.java:185) [artemis-server-2.11.0.jar:2.11.0] 
    at org.apache.activemq.artemis.core.server.impl.SharedStoreBackupActivation.run(SharedStoreBackupActivation.java:118) [artemis-server-2.11.0.jar:2.11.0] 
    at org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$ActivationThread.run(ActiveMQServerImpl.java:3863) [artemis-server-2.11.0.jar:2.11.0]

The master attempts to start, but it doesn't progress beyond AMQ221034: Waiting indefinitely to obtain live lock. The logs are stuck at this point even after multiple restarts.

2021-01-15 23:03:56,238 INFO  [org.apache.activemq.artemis.core.server] AMQ221006: Waiting to obtain live lock
2021-01-15 23:03:56,300 INFO  [org.apache.activemq.artemis.core.server] AMQ221013: Using NIO Journal
2021-01-15 23:03:56,581 INFO  [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-server]. Adding protocol support for: CORE
2021-01-15 23:03:56,581 INFO  [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-amqp-protocol]. Adding protocol support for: AMQP
2021-01-15 23:03:56,581 INFO  [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-hornetq-protocol]. Adding protocol support for: HORNETQ
2021-01-15 23:03:56,581 INFO  [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-mqtt-protocol]. Adding protocol support for: MQTT
2021-01-15 23:03:56,581 INFO  [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-openwire-protocol]. Adding protocol support for: OPENWIRE
2021-01-15 23:03:56,581 INFO  [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-stomp-protocol]. Adding protocol support for: STOMP
2021-01-15 23:03:56,644 WARN  [org.apache.activemq.artemis.core.server] AMQ222035: Directory \\test\data\paging\cd776bae-1a55-11eb-985d-0050569136c8 did not have an identification file address.txt
2021-01-15 23:03:56,644 WARN  [org.apache.activemq.artemis.core.server] AMQ222035: Directory \\test\data\paging\a84f1e4f-1f1a-11eb-a37f-0050569136c8 did not have an identification file address.txt
2021-01-15 23:03:56,644 WARN  [org.apache.activemq.artemis.core.server] AMQ222035: Directory \\test\data\paging\a87edff5-1f1a-11eb-a37f-0050569136c8 did not have an identification file address.txt
2021-01-15 23:03:56,988 INFO  [org.apache.activemq.artemis.core.server] AMQ221034: Waiting indefinitely to obtain live lock

Can you please advise on the issue here and the steps to recover?

Does the NPE in the slave start-up have any effects on the queue/functioning?

Do I need to stop the Slave manually to get the master to start successfully?

1
Also, I would expect to see additional details about the broker shutdown before the SHUTDOWN_ERROR messages. Can you provide the full log file? - Justin Bertram
Yes, It is NFS Shared store, The issue is in one of our customer's instance and I don't have the details of mount options. - Suman Moorthy

1 Answers

0
votes

If the master broker encounters any "critical IO" problems it will automatically shut itself down. When it shuts itself down it will relinquish the lock it has on the shared journal. When the shared lock is relinquished the slave will automatically activate. When the master is restarted it will attempt to acquire the lock on the shared journal but it won't be able to since the slave has it.

The slave failed to set up the FailbackChecker thread due to a NullPointerException because there is no <cluster-connection> configured in either broker.xml. This is an invalid configuration. You must configure a <cluster-connection>, e.g.:

      <cluster-connections>
         <cluster-connection name="my-cluster">
            <connector-ref>netty</connector-ref>
            <message-load-balancing>ON_DEMAND</message-load-balancing>
            <discovery-group-ref discovery-group-name="teamsMQ-discovery-group"/>
         </cluster-connection>
      </cluster-connections>

Since the FailbackChecker thread isn't running the slave will not know that the master has restarted and initiated a fail-back. Therefore, you will need to stop the slave so it can relinquish its lock on the shared journal. At this point the master broker will start. Keep in mind that any clients connected to the slave will be disconnected and will have to reconnect to the master.