13
votes

We have a TIBCO EMS solution that uses built-in server failover in a 2-4 server environment. If the TIBCO admins fail-over services from one EMS server to another, connections are supposed to be transfered to the new server automatically at the EMS service level. For our C# applications using the EMS service, this is not happening - our user connections are not being transfered to the new server after failover and we're not sure why.

Our application connection to EMS at startup only so if the TIBCO admins failover after users have started our application, they users need to restart the app in order to reconnect to the new server (our EMS connection uses a server string including all 4 production EMS servers - if the first attempt fails, it moves to the next server in the string and tries again).

I'm looking for an automated approach that will attempt to reconnect to EMS periodically if it detects that the connection is dead but I'm not sure how best to do that.

Any ideas? We are using TIBCO.EMS.dll version 4.4.2 and .Net 2.x (SmartClient app)

Any help would be appreciated.

3
How are you currently implementing fault tolerance? On the server in the 'factories.conf' file? Does your 'url' property contain the comma separated list of URLs specified in the 'tib_ems_dotnet_ref.pdf' on page 134?Anthony Mastrean
Yes, and that, believe it or not is the problem. Connections should be transfered from one server to another when the EMS server is failed over. This should work when you have a delimited list of EMS servers in your connection string but I believe there's a bug in the EMS.lib making it not workScottCher
Is an error thrown after failover by the listeners? Is an error thrown after failover by the producers (when sending a message)? Most likely, the library expects you to reconnect. The providing multiple servers in the connection string just lets it round robin during connect - not later...TheSoftwareJedi
I hesitate to dive into a client programming solution when we're not sure exactly what the server is doing. Can you provide more information on how/when you know failover is not working? (note: client failover notification - tib ems user guide pg 292, tib dotnet ref pg 220).Anthony Mastrean
ConnectionAttempts sets the number of times a client will re-try to connect to the server. ReconnectAttempts sets the number of times a client will try to re-connect after a network disconnect. Neither of those appear to do what they say. Yes, we get exceptions on Publish.ScottCher

3 Answers

6
votes

This post should sum up my current comments and explain my approach in more detail...

The TIBCO 'ConnectionFactory' and 'Connection' types are heavyweight, thread-safe types. TIBCO suggests that you maintain the use of one ConnectionFactory (per server configured factory) and one Connection per factory.

The server also appears to be responsible for in-place 'Connection' failover and re-connection, so let's confirm it's doing its job and then lean on that feature.

Creating a client side solution is going to be slightly more involved than fixing a server or client setup problem. All sessions you have created from a failed connection need to be re-created (not to mention producers, consumers, and destinations). There are no "reconnect" or "refresh" methods on either type. The sessions do not maintain a reference to their parent connection either.

You will have to manage a lookup of connection/session objects and go nuts re-initializing everyone! or implement some sort of session failure event handler that can get the new connection and reconnect them.

So, for now, let's dig in and see if the client is setup to receive failover notification (tib ems users guide pg 292). And make sure the raised exception is caught, contains the failover URL, and is being handled properly.

8
votes

First off, yes, I am answering my own question. Its important to note, however, that without ajmastrean, I would be nowhere. thank you so much!

ONE: ConnectionFactory.SetReconnAttemptCount, SetReconnAttemptDelay, SetReconnAttemptTimeout should be set appropriately. I think the default values re-try too quickly (on the order of 1/2 second between retries). Our EMS servers can take a long time to failover because of network storage, etc - so 5 retries at 1/2s intervals is nowhere near long enough.

TWO: I believe its important to enable the client-server and server-client heartbeats. Wasn't able to verify but without those in place, the client might not get the notification that the server is offline or switching in failover mode. This, of course, is a server side setting for EMS.

THREE: you can watch for failover event by setting Tibems.SetExceptionOnFTSwitch(true); and then wiring up a exception event handler. When in a single-server environment, you will see a "Connection has been terminated" message. However, if you are in a fault-tolerant multi-server environment, you will see this: "Connection has performed fault-tolerant switch to ". You don't strictly need this notification, but it can be useful (especially in testing).

FOUR: Apparently not clear in the EMS documentation, connection reconnect will NOT work in a single-server environment. You need to be in a multi-server, fault tolerant environment. There is a trick, however. You can put the same server in the connection list twice - strange I know, but it works and it enables the built-in reconnect logic to work.

some code:

private void initEMS()
{
    Tibems.SetExceptionOnFTSwitch(true);
    _ConnectionFactory = new TIBCO.EMS.TopicConnectionFactory(<server>);
    _ConnectionFactory.SetReconnAttemptCount(30);       // 30retries
    _ConnectionFactory.SetReconnAttemptDelay(120000);   // 2minutes
    _ConnectionFactory.SetReconnAttemptTimeout(2000);   // 2seconds
_Connection = _ConnectionFactory.CreateTopicConnectionM(<username>, <password>);
    _Connection.ExceptionHandler += new EMSExceptionHandler(_Connection_ExceptionHandler);
}
private void _Connection_ExceptionHandler(object sender, EMSExceptionEventArgs args)
{
    EMSException e = args.Exception;
    // args.Exception = "Connection has been terminated" -- single server failure
    // args.Exception = "Connection has performed fault-tolerant switch to <server url>" -- fault-tolerant multi-server
    MessageBox.Show(e.ToString());
}
1
votes

Client applications may receive notification of a failover by setting the tibco.tibjms.ft.switch.exception system property

Perhaps the library needs that to work?