2
votes

I can't figure out how to get the messaging subsystem in Wildfly 12 to redeliver a queue message to a different node when the original node fails. I need a way to direct a message to a different node when the first attempted node doesn't acknowledge/commit soon enough.

I have a 3-node Wildfly 12 cluster with a single queue (TestQueue). I deploy an application with a single bean that grabs a JMS connection and creates a session with a consumer on that queue; here's the constructor:

public TestMessageListener( ConnectionFactory connectionFactory, Destination destination )
{
    this.context = connectionFactory.createContext( JMSContext.DUPS_OK_ACKNOWLEDGE );
    this.consumer = this.context.createConsumer( destination );
    this.consumer.setMessageListener( this );
    this.context.start();
}

The connection factory and destination are injected elsewhere:

@Resource( lookup = "java:/ConnectionFactory" ) private ConnectionFactory connectionFactory;
@Resource( lookup = "java:/jms/queue/TestQueue" ) private Destination destination;

Back in the listener I just log what it receives:

@Override
public void onMessage( Message message )
{
    try
    {
        message.acknowledge();
        String body = new String( message.getBody( byte[].class ), StandardCharsets.UTF_8 );
        LOG.info( body );
    }
    catch ( JMSException e )
    {
        LOG.warning( e.toString() );
    }
}

Finally, I have STOMP enabled in the messaging subsystem configuration:

<socket-binding-groups>
    <socket-binding-group name="full-ha-sockets" default-interface="public">
        <socket-binding name="stomp" port="6164"/>
        ... 
    </socket-binding-group>
</socket-binding-groups>

<subsystem xmlns="urn:jboss:domain:messaging-activemq:3.0">
    <server name="default">
        <remote-acceptor name="stomp-acceptor" socket-binding="stomp">
            <param name="protocols" value="STOMP"/>
        </remote-acceptor>            
        <address-setting name="jms.queue.TestQueue" redistribution-delay="0"/>
        ...
    </server>

I connect over stomp and send a test message every 2 seconds with a unique identifier. Each of the 3 nodes receives one in turn, round-robin. Then I unplug the network cable from one of the nodes.

After 1 minute (which I assume is connection-ttl), I get error messages on the other 2 nodes about connection failures:

2018-07-11 20:02:18,813 INFO  [TestMessageListener] (Thread-1 (ActiveMQ-client-global-threads)) TEST 435
2018-07-11 20:02:21,448 WARN  [org.apache.activemq.artemis.core.client] (Thread-8 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$3@3070595f)) AMQ212037: Connection failure has been detected: AMQ119014: Did not receive data from /192.168.1.82:51046 within the 60,000ms connection TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT]
2018-07-11 20:02:21,449 WARN  [org.apache.activemq.artemis.core.server] (Thread-8 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$3@3070595f)) AMQ222061: Client connection failed, clearing up resources for session b7be7d58-855c-11e8-91dd-6c626d5557a6
2018-07-11 20:02:21,449 WARN  [org.apache.activemq.artemis.core.server] (Thread-8 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$3@3070595f)) AMQ222107: Cleared up resources for session b7be7d58-855c-11e8-91dd-6c626d5557a6
2018-07-11 20:02:21,449 WARN  [org.apache.activemq.artemis.core.server] (Thread-8 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$3@3070595f))     AMQ222061: Client connection failed, clearing up resources for session b7becb79-855c-11e8-91dd-6c626d5557a6
2018-07-11 20:02:21,449 WARN  [org.apache.activemq.artemis.core.server] (Thread-8 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$3@3070595f)) AMQ222107: Cleared up resources for session b7becb79-855c-11e8-91dd-6c626d5557a6

After an additional 30 seconds, I get another round of error messages about connection failures:

2018-07-11 20:02:49,443 WARN  [org.apache.activemq.artemis.core.client] (Thread-1 (ActiveMQ-client-global-threads)) AMQ212037: Connection failure has been detected: AMQ119011: Did not receive data from server for org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnection@5e696d4b[local= /192.168.1.27:39202, remote=/192.168.1.82:8080] [code=CONNECTION_TIMEDOUT]
2018-07-11 20:02:49,444 WARN  [org.apache.activemq.artemis.core.server] (Thread-1 (ActiveMQ-client-global-threads)) AMQ222095: Connection failed with failedOver=false
2018-07-11 20:02:49,446 WARN  [org.apache.activemq.artemis.core.server] (Thread-1 (ActiveMQ-client-global-threads)) AMQ222095: Connection failed with failedOver=false

Note that my STOMP client is connected to one of the good nodes and continues to send messages to the queue while the "failed" box is disconnected.

My problems with this are:

  • During the 90 seconds, Artemis continues to deliver messages to the unplugged box.
  • I can't figure out how to get Artemis to try redelivering old messages to a different node even after the 90 seconds has elapsed.
  • I don't understand why it continues to try delivering messages to the unplugged box after the first round of connection errors at 60s.
  • Setting redistribution-delay has no effect, though I thought it would be useful due to https://activemq.apache.org/artemis/docs/latest/clusters.html.

Like this:

<address-setting name="jms.queue.TestQueue" redistribution-delay="0"/>
  • When I plug the network cable back in, all messages that should have been delivered to the failed node are now delivered.

For this particular queue I would not only like redelivery attempts to another node, but I'd like a message acknowledgement timeout to trigger this redelivery or, failing that, a small connection-ttl of 750-1000ms. If I set connection-ttl to even 15000ms, all connections between all nodes (even when the whole cluster is healthy) throw errors after 15000ms. According to the documentation at https://activemq.apache.org/artemis/docs/latest/configuration-index.html, this parameter is, "TTL for the Bridge. This should be greater than the ping period." It's unclear what parameter the "ping period" is and it's even more unclear how such a parameter would map to the Wildfly subsystem configuration. I'm assuming connection-ttl is here, where I set it to 15000:

<cluster-connection name="my-cluster" address="jms" connector-name="http-connector" connection-ttl="60000" retry-interval-multiplier="1.5" max-retry-interval="60000" discovery-group="dg-group1"/>

I'm perfectly fine with receiving and dealing with duplicate messages; I thought a combination of JMSContext.DUPS_OK_ACKNOWLEDGE and redistribution-delay="0" would solve at least the redelivery portion of it.

I tried JMSContext.TRANSACTED and used JMSContext.commit() and JMSContext.rollback(). Obviously rollback() doesn't apply when the failed node is cut off from the rest of the cluster, but it's the only way I can see to trigger redelivery.

I'm at the stage right now where I'm just tweaking a seemingly endless number of configuration parameters with little to no effect. Any help would be greatly appreciated.

1

1 Answers

0
votes

I believe what is happening here is that the default reconnect-attempts of -1 is being used for the cluster-connection, and as long as the cluster-connection is attempting to reconnect to a down node then messages for that node will be kept in the special "store-and-forward" queue. You should be able to set reconnect-attempts to something other than -1 so the cluster-connection will give up attempting to reconnect at which point the messages meant for the other node will become generally available for consumption.