0
votes

I have a problem with HazelcastClient (Java) when the cluster goes down. The version of Hazelcast is the last one 3.8.1 for both client and cluster

Periodically I have the following code which is executed

getMap().executeOnEntries(new MyProcessor<>(), Predicates.equal("field", var));

The problem is that when the cluster goes down, The error thrown by hazelcast only logs warning but don't throw an exception:

2017-04-28 18:32:19,905 [WARN] from com.hazelcast.client.connection.ClientConnectionManager in hz.client_0.internal-1 - hz.client_0 [aa-api] [3.8.1] Heartbeat failed to connection : ClientConnection{alive=true, connectionId=1, socketChannel=DefaultSocketChannelWrapper{socketChannel=java.nio.channels.SocketChannel[connected local=/xxx.xxx.4.125:49688 remote=/xxx.xxx.8.118:5701]}, remoteEndpoint=[xxx.xxx.8.118]:5701, lastReadTime=2017-04-28 18:31:15.445, lastWriteTime=2017-04-28 18:32:14.905, closedTime=never, lastHeartbeatRequested=2017-04-28 18:32:14.905, lastHeartbeatReceived=2017-04-28 18:31:14.905, connected server version=3.8.1}
2017-04-28 18:32:20,884 [WARN] from com.hazelcast.client.spi.ClientPartitionService in hz.client_0.internal-3 - hz.client_0 [aa-api] [3.8.1] Error while fetching cluster partition table!
java.util.concurrent.ExecutionException: com.hazelcast.spi.exception.TargetDisconnectedException: Heartbeat timed out to owner connection ClientConnection{alive=true, connectionId=1, socketChannel=DefaultSocketChannelWrapper{socketChannel=java.nio.channels.SocketChannel[connected local=/xxx.xxx.4.125:49688 remote=/xxx.xxx.8.118:5701]}, remoteEndpoint=[xxx.xxx.8.118]:5701, lastReadTime=2017-04-28 18:31:15.445, lastWriteTime=2017-04-28 18:32:14.905, closedTime=never, lastHeartbeatRequested=2017-04-28 18:32:14.905, lastHeartbeatReceived=2017-04-28 18:31:14.905, connected server version=3.8.1}
at com.hazelcast.client.spi.impl.ClientInvocationFuture.resolve(ClientInvocationFuture.java:73)
at com.hazelcast.spi.impl.AbstractInvocationFuture$1.run(AbstractInvocationFuture.java:243)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at com.hazelcast.util.executor.LoggingScheduledExecutor$LoggingDelegatingFuture.run(LoggingScheduledExecutor.java:128)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:76)
at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:92)
Caused by: com.hazelcast.spi.exception.TargetDisconnectedException: Heartbeat timed out to owner connection ClientConnection{alive=true, connectionId=1, socketChannel=DefaultSocketChannelWrapper{socketChannel=java.nio.channels.SocketChannel[connected local=/xxx.xxx.4.125:49688 remote=/xxx.xxx.8.118:5701]}, remoteEndpoint=[xxx.xxx.8.118]:5701, lastReadTime=2017-04-28 18:31:15.445, lastWriteTime=2017-04-28 18:32:14.905, closedTime=never, lastHeartbeatRequested=2017-04-28 18:32:14.905, lastHeartbeatReceived=2017-04-28 18:31:14.905, connected server version=3.8.1}
at com.hazelcast.client.spi.impl.ClientInvocationServiceSupport$CleanResourcesTask.notifyException(ClientInvocationServiceSupport.java:229)
at com.hazelcast.client.spi.impl.ClientInvocationServiceSupport$CleanResourcesTask.run(ClientInvocationServiceSupport.java:214)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
... 6 common frames omitted
Caused by: com.hazelcast.spi.exception.TargetDisconnectedException: Heartbeat timed out to owner connection ClientConnection{alive=true, connectionId=1, socketChannel=DefaultSocketChannelWrapper{socketChannel=java.nio.channels.SocketChannel[connected local=/xxx.xxx.4.125:49688 remote=/xxx.xxx.8.118:5701]}, remoteEndpoint=[xxx.xxx.8.118]:5701, lastReadTime=2017-04-28 18:31:15.445, lastWriteTime=2017-04-28 18:32:14.905, closedTime=never, lastHeartbeatRequested=2017-04-28 18:32:14.905, lastHeartbeatReceived=2017-04-28 18:31:14.905, connected server version=3.8.1}
at com.hazelcast.client.spi.impl.ClusterListenerSupport.heartbeatStopped(ClusterListenerSupport.java:259)
at com.hazelcast.client.connection.nio.ClientConnectionManagerImpl$Heartbeat.fireHeartbeatStopped(ClientConnectionManagerImpl.java:503)
at com.hazelcast.client.connection.nio.ClientConnectionManagerImpl$Heartbeat.run(ClientConnectionManagerImpl.java:462)
... 10 common frames omitted
2017-04-28 18:32:22,904 [WARN] from com.hazelcast.client.connection.nio.ClientConnection in hz.client_0.internal-1 - hz.client_0 [aa-api] [3.8.1] ClientConnection{alive=false, connectionId=1, socketChannel=DefaultSocketChannelWrapper{socketChannel=java.nio.channels.SocketChannel[connected local=/xxx.xxx.4.125:49688 remote=/xxx.xxx.8.118:5701]}, remoteEndpoint=[xxx.xxx.8.118]:5701, lastReadTime=2017-04-28 18:31:15.445, lastWriteTime=2017-04-28 18:32:14.905, closedTime=2017-04-28 18:32:19.905, lastHeartbeatRequested=2017-04-28 18:32:14.905, lastHeartbeatReceived=2017-04-28 18:31:14.905, connected server version=3.8.1} lost. Reason: com.hazelcast.spi.exception.TargetDisconnectedException[Heartbeat timed out to owner connection ClientConnection{alive=true, connectionId=1, socketChannel=DefaultSocketChannelWrapper{socketChannel=java.nio.channels.SocketChannel[connected local=/xxx.xxx.4.125:49688 remote=/xxx.xxx.8.118:5701]}, remoteEndpoint=[xxx.xxx.8.118]:5701, lastReadTime=2017-04-28 18:31:15.445, lastWriteTime=2017-04-28 18:32:14.905, closedTime=never, lastHeartbeatRequested=2017-04-28 18:32:14.905, lastHeartbeatReceived=2017-04-28 18:31:14.905, connected server version=3.8.1}]

How can I handle this exception so I can take actions?

Thanks,

EDIT: the problem occurs also when the node I was connected to is disconnected. The client doesn't connect to another node (AWS Discovery).

1
What is your Hazelcast version (both client / server)? - noctarius
Thanks, any specific hazelcast(-client).xml configurations? - noctarius
Actually, after three days of research, I found that it was a problem of configuration. Actually, there is a missing feature for the reconnection: github.com/hazelcast/hazelcast/issues/9692. Also, the heartbeat was too high, so the client was not able to detect the dead node in time. By modifying the configuration and verifying before operation the health, I can now detect dead node - Flo354
Glad it was solved :) You might want an answer yourself to explain others what was wrong. - noctarius
Thanks, I forgot. You can comment the question or update it if you can/want - Flo354

1 Answers

2
votes

The problem was mainly about configuration. Some timeout and health checking interval were too high.

Bellow, the default properties for clients:

hazelcast.client.heartbeat.interval = 10000ms

hazelcast.client.heartbeat.timeout = 300000ms

hazelcast.client.invocation.timeout.seconds = 120s

And here my new values

hazelcast.client.heartbeat.interval = 2000

hazelcast.client.heartbeat.timeout = 5000

hazelcast.client.invocation.timeout.seconds = 10

Also, i reworked entirely the way I get maps, topics, and more generally, the hazelcast instance.

At instanciation time

I handle every exceptions (Mostly extending RuntimeException), and I notify each classes using it that the instance is now available.

try {
    hazelcastInstance = HazelcastClient.newHazelcastClient(config);
    eventListeners.forEach(HazelcastEventListener::onConnect);
} catch (Throwable e) {
    Logger.error(e.getMessage(), e);
    return null;
}

Before each requests which use the instance

I call a code which verify the availability of the instance, and if an error occurs, I notify each classes using it that the instance is down.

public boolean isClientActive() {
    if (getInstance() == null) {
        return false;
    }

    try {
        getMap("registration").isLocked("a");
    } catch (Throwable e) {
        hazelcastInstance = null;
        eventListeners.forEach(HazelcastEventListener::onDisconnect);
        return false;
    }

    return true;
}

Get notified when a member left

// add a membership listener on the cluster
// to get notified when a member is removed
hazelcastInstance.getCluster().addMembershipListener(new MembershipListener() {
    @Override
    public void memberAdded(MembershipEvent membershipEvent) {}

    @Override
    public void memberRemoved(MembershipEvent membershipEvent) {
        if (membershipEvent.getMembers().isEmpty()) {
            restartInstance();
        }
    }

Handling of my HazelcastEventListener

Each class using hazelcast registers an eventListener

    hazelcastManager.addEventListener(new HazelcastEventListener() {
        @Override
        public void onConnect() {
            map = hazelcastManager.getMap(mapName);
        }

        @Override
        public void onDisconnect() {
            map = null;
        }
    });

Reconnecting hazelcast client

Calling getInstance() will try to reconnect when the hazelcastInstance is null.

Problems

It avoid many errors, but there is some work left to do, to manage concurrency problems. Actually, I consider this solution as a workaround since it's not very efficient, and mostly patches on a missing feature in Hazelcast.

That's why I will not "accept" this solution. If someone has a better solution, please let us know.