1
votes

Windows Server 2003R2/2008R2/2012, Openfire 3.8.1, Hazelcast 1.0.4, MySQL 5.5.30-ndb-7.2.12-cluster-gpl-log

We've set up 5 servers in Openfire Cluster. Each of them in a different subnet, subnets are located in different cities and interconnected with each other through VPN routers (2-8 Mbps):

192.168.0.1 - node0
192.168.1.1 - node1
192.168.2.1 - node2
192.168.3.1 - node3
192.168.4.1 - node4

Openfire configured to use MySQL database which is successfully replicating from the master node0 to all slave nodes (each node uses it's own local database server, functioning as slave).

In Openfire Web Admin > Server Manager > Clustering we are able to see all cluster nodes.

Openfire custom settings for Hazelcast:

hazelcast.max.execution.seconds - 30
hazelcast.startup.delay.seconds - 3
hazelcast.startup.retry.count - 3
hazelcast.startup.retry.seconds - 10

Hazelcast config for node0 (similar on other nodes except for interface section) (%PROGRAMFILES%\Openfire\plugins\hazelcast\classes\hazelcast-cache-config.xml):

<join>
  <multicast enabled="false" /> 
  <tcp-ip enabled="true">
    <hostname>192.168.0.1:5701</hostname> 
    <hostname>192.168.1.1:5701</hostname> 
    <hostname>192.168.2.1:5701</hostname> 
    <hostname>192.168.3.1:5701</hostname> 
    <hostname>192.168.4.1:5701</hostname>
  </tcp-ip>
  <aws enabled="false" /> 
</join>
<interfaces enabled="true">
  <interface>192.168.0.1</interface> 
</interfaces>

These are the only settings changed from default ones.

The problem is that XMPP clients are authorizing too long, about 3-4 minutes, after authorization other users in roster are inactive for 5-7 minutes, during this time logged in user in Openfire Web Admin > Sessions is marked as Offline. Even after user is able to see other logged in users as active, messages are not delivered, or delivered after 5-10 minutes or after few Openfire restarts...

We appreciate any help. We spent about 5 days trying to set up this monster, and are out of any ideas... :(

Thanks a lot in advance!

UPD 1: Installed Openfire 3.8.2 alpha with Hazelcast 2.5.1 Build 20130427 same problem

UPD 2: Tried starting the cluster on two servers that are in the same city, separated by probably 1-2 hops @ 1-5ms ping. Everything works perfectly! Then we stopped one of those servers and started one in another city (3-4 hops @ 80-100 ms ping) the problem occured again... Slow authorizations, logged off users in roster, messages are not delivered on time etc.

UPD 3: Installed Openfire 3.8.2 without JRE, and Java SDK 1.70_25.

Here are JMX screenshots:

node 0: node0

node 1: node1

Red line is the first client connection (after Openfire restart). Tested on two users. Same thing... First user (node0) connected instantly, second user (node1) spent 5 seconds on connection. Rosters have been showing offline users on both sides for 20-30 seconds, then online users start appearing in them. First user sends message to second user. Second user waits for 20 seconds, then receives first message. Reply and all other messages are transfered instantly.

UPD 4:

Durring the diggin through JConsole "Threads" tab we've discovered these various states:

For example hz.openfire.cached.thread-3:

WAITING on java.util.concurrent.SynchronousQueue$TransferStack@8a5325
Total blocked: 0  Total waited: 449

Maybe this could help... We actually don't know where to look for.

Thanks!

1

1 Answers

1
votes

[UPDATE] Note per the Hazelcast documentation - WAN replication is supported in their enterprise version only, not in the community version that is shipped with Openfire. You must obtain an enterprise license key from Hazelcast if you would like to use this feature.

You may opt to setup multiple LAN-based Openfire clusters and then federate them using the S2S integration across separate XMPP domains. This is the preferred approach for scaling up Openfire for a very large user base.

[Original post follows]

My guess is that the longer network latency in your remote cluster configuration might be tying up the Hazelcast executor threads (for queries and events). Some of these events and queries are invoked synchronously within an Openfire cluster. Try tuning the following properties:

hazelcast.executor.query.thread.count (default: 8)
hazelcast.executor.event.thread.count (default: 16)

I would start by setting these values to 40/80 (5x) respectively to see if there is any improvement in the overall application responsiveness, and potentially even higher based on your expected load. Additional Hazelcast settings (including other thread pools) plus instructions for adding these properties into the configuration XML can be found here:

Hazelcast configuration properties

Hope that helps ... and good luck!