Skip to main content

Glassfish cluster + Remote instance fails to recover EJB timers

1 reply [Last post]
ameyc
Offline
Joined: 2013-01-24
Points: 0

Our setup is: Glassfish version 3.1.2.2 -

1. DAS and instance-1 running on the same machine, while instance-2 is running on another machine in the same network as config node.
2. We have set up transaction logging in a shared directory as per the Glassfish High Availability Guide: http://docs.oracle.com/cd/E18930_01/html/821-2416/gjjpy.html#gaxim
3. We are using unicast configuration for cluster communication since we have Network Load Balancer running in multicast mode in the network.
4. Our application (.ear containing multiple .war) has 2 persistent timers (since we need only one timer instance per timer at a time in the cluster).

When instance-1 (or instance-2) is shut down normally, the other instance recovers up the timers from the shut-down instance as expected. When instance-2 crashes or goes offline abnormally, instance-1 recovers its timers (again, as expected). But when instance-1 crashes, instance-2 does not seem to recover its timers as expected.

As far as I can see from the logs, instance-2 receives proper failover message for instance-1 and starts the recovery, but finishes it without recovering any transactions or timers for the failed instance.

Can anyone tell me what the problem can be? (Should I provide any more information?)

Reply viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
ameyc
Offline
Joined: 2013-01-24
Points: 0

After 2 weeks or so of work, we have finally found the problem.

It seems when an instance in a cluster goes down, the recovery instance checks if the instance is still up by trying to access the "node-host":"admin-node-port" of the downed instance. If you are using the standard created node on the DAS (as we were), the node-host is set to "localhost".

So, instance-2 was trying to see if instance-1 is down by trying to connect to "localhost", instead of "instance-1-ip" as it should have been. Since it could connect to localhost, the instance-1 was falsely marked as running and the recovery didn't go ahead.

We had to change the node-host for instance-1 node in domain config.xml to fix this, since the configuration of default localhost- cannot be changed through asadmin or admin console.