Skip to main content

JMS HA has issues

5 replies [Last post]
sgjava
Offline
Joined: 2005-07-05
Points: 0

Configuration:

glassfish1.corp.local

  • cluster
  • ssh node glassfish1
  • instance1

glassfish2.corp.local

  • ssh node glassfish2
  • instance2

Firstly, I started with a conventional cluster with master broker, but if you lose the master broker (i.e. power it off), the rest of the instances on the other servers accept messages, but they never reach the MDB (onMessage), see comment http://www.java.net/forum/topic/glassfish/glassfish/orb-not-listening-se....

Secondly, I tried a conventional cluster of peer brokers using HA-JDBC and Derby on both GlassFish servers to provide the JMS configuration store with HA as well. When I power off the glassfish2 server, messages will send (via ORB or direct) without error, but never be picked up from the queue. You can see on glassfish1 that instance2 is in stopped state, so it detects the host going down.

This is the code I use in my unit test to see if the queue is empty:

    public final void waitForEmptyQueue(final QueueSession queueSession,
            final Queue queue) throws Exception {
        log.debug("waitForEmptyQueue");
        // Wait for queue to empty
        boolean hasMoreItem = true;
        while (hasMoreItem) {
            final QueueBrowser browser = queueSession.createBrowser(queue);
            hasMoreItem = browser.getEnumeration().hasMoreElements();
            browser.close();
            try {
                Thread.sleep(100);
            } catch (InterruptedException e) {
                throw new RuntimeException(e);
            }
        }
    }

Then if I power on glassfish2, but do not start GlassFish then things work as expected. This is a big issue since you cannot guarantee a host will come back online in a timely manner. Is there some type of configuration that will prevent this zombie condition?

I also tried this with glassfish1 (leaving glassfish2 running) and the hang only occurs going direct:

13:47:49.705 [main] WARN  javax.jms - [C4003]: Error occurred on connection creation [glassfish1.corp.local:27676]. - cause: java.net.ConnectException: Connection refused
13:47:52.936 [main] WARN  javax.jms - [C4003]: Error occurred on connection creation [glassfish1.corp.local:27676]. - cause: java.net.ConnectException: Connection refused
13:47:56.051 [main] DEBUG com.bhn.services.jms.ClusterTest - waitForEmptyQueue

When I power glassfish1 back on things start working again. It appears that the host must be up even if GlassFish is not running on it.

Reply viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
sgjava
Offline
Joined: 2005-07-05
Points: 0

OK, it looks as if fail over is working after some period of time (seems like a few minutes). I guess this is what I'd like to tweak if possible. If I power off glassfish2 then instance2 is taken offline:

member: instance2 of group: cluster1 has failed

After a few minutes ORB and direct JMS calls work.

If I power off glassfish1 (cluster server) and leave glassfish2 up messages are still not accepted. I'd expect a peer configuration to just work if you lose a server.

amyk
Offline
Joined: 2007-09-10
Points: 0

>I started with a *conventional cluster with master broker*, but if you lose the master broker (i.e. power it off), the rest of the instances on the other servers accept messages, but they never reach the MDB (onMessage),

What is the jms-service type (EMBEDDED, LOCAL or REMOTE) used ? What is the version of broker (shown at the beginning of the broker log when the broker starts up) ? Could you please provide more information on your application - how the messages are produced and from what component ? How did you know the other servers accept messages ? What do the broker logs show ? If you run 'imqcmd list bkr' and 'imqcmd list dst' on each broker, what are the outputs ?

>If I power off glassfish1 (cluster server) and leave glassfish2 up messages are still not accepted. I'd expect a peer configuration to just work if you lose a server.

What do you mean by "messages are still not accepted" ? Did you use 'asadmin configure-jms-cluster' to configure the JMS service cluster ? What is the JMS cluster type used ? Could you please provide sections of broker log on its startup ?

The following GlassFish and MQ documentations can be helpful
GlassFish "Administering the Java Message Service (JMS)"
http://docs.oracle.com/cd/E18930_01/html/821-2416/abljw.html#scrolltoc
GlassFish "Configuring Java Message Service High Availability"
http://docs.oracle.com/cd/E18930_01/html/821-2426/abdbk.html#scrolltoc
MQ "Broker Clusters"
http://docs.oracle.com/cd/E18930_01/html/821-2443/aerdj.html#scrolltoc

sgjava
Offline
Joined: 2005-07-05
Points: 0

Here's the configurations I'm testing with GlassFish 3.1.2.2 (using HA-JDBC for peer and enhanced configurations):

Embedded Conventional Broker Cluster With Master Broker:
configure-jms-cluster --clustertype=conventional --configstoretype=masterbroker mascluster

Embedded Conventional Broker Cluster of Peer Brokers:
configure-jms-cluster --configstoretype shareddb --messagestoretype file --clustertype conventional --dbvendor derby --dburl jdbc:ha-jdbc:glassfish --property cluster.sharecc.persist.jdbc.derby.driver=net.sf.hajdbc.sql.Driver mascluster

Local Enhanced Broker Cluster:
set mascluster.jms-service.type=LOCAL
configure-jms-cluster --clustertype enhanced --dbvendor derby --dburl jdbc:ha-jdbc:glassfish --property imq.persist.jdbc.derby.driver=net.sf.hajdbc.sql.Driver mascluster

I have a simple echo bean that puts the produced text message on a topic which the client consumes and verifies. The portion that hangs is a QueueBrowser using hasMoreItem = browser.getEnumeration().hasMoreElements(); This is only used in unit tests to wait for the queue to empty before reading all the messages on the topic. If I remove this code and use a delay I'm able to read the topic. Basically I'm trying to test fail over with a large number of messages (10000) and make sure they come full circle in the topic. I realize only enhanced cluster allows for guaranteed message delivery, but I want the client to be able to continue to send messages when a broker fails. One thing that helped is to catch exceptions in the client and get a new connection on failure:

    public final void getNewSender() throws JMSException, NamingException {
        // Connection
        queueConnection = queueConnectionFactory.createQueueConnection();
        // Lookup queue
        queue = jndiLookup(initialContext, "jms/TestQueue");
        // Session
        queueSession = queueConnection.createQueueSession(false,
                Session.AUTO_ACKNOWLEDGE);
        // Sender       
        sender = queueSession.createSender(queue);
    }

        for (int i = 0; i < MESSAGES_TO_SEND; i++) {
            try {
                sender.send(queueSession.createTextMessage(TEST_MSG));
                try {
                    Thread.sleep(MESSAGES_DELAY);
                } catch (InterruptedException e) {
                    throw new RuntimeException(e);
                }
            } catch (JMSException e) {
                log.warn(e.getMessage());
                getNewSender();
            }
        }

I haven't been able to find a good code example that handles a failed broker connection in the various GlassFish broker configurations.

amyk
Offline
Joined: 2007-09-10
Points: 0

When the system that is running a GlassFish instance is powered off, it's possible a connection to broker becomes half-open which may take minutes (depending on system configuration) for the TCP layer to abort the connection. That could be the reason for the failover after "a few minutes". The half-open connection can be proactively aborted by MQ client runtime by setting the following connection factory properties appropriately
imqPingAckTimeout (default 0, no timeout)
imqAbortOnPingAckTimeout (default false, no automatic abort connection on ping ack timeout)

Please note that the above settings affect all connection state, not just half-open connection. The ping interval is controlled by connection factory property imqPingInterval (default 30sec). Please see more information on these properties in com.sun.messaging.ConnectionFactory javadoc
http://mq.java.net/javadoc/4.5/javadoc/com/sun/messaging/ConnectionConfi...

>want the client to be able to continue to send messages when a broker fails

If the JMS cluster type is not enhanced cluster, please make sure reconnectEnabled is set to true for the connection factory used by the sender
http://docs.oracle.com/cd/E18930_01/html/821-2438/aeoop.html#scrolltoc

sgjava
Offline
Joined: 2005-07-05
Points: 0

OK, creating the connection (instead of using ORB lookup) I made the settings below with a conventional cluster (which should work with stand alone instances as well). This allowed sending messages without throwing an exception (only warnings after powering off master broker in the middle):

        connectionFactory.
                setProperty(ConnectionConfiguration.imqReconnectEnabled, "True");
        connectionFactory.
                setProperty(ConnectionConfiguration.imqReconnectAttempts, "1");
        connectionFactory.
                setProperty(ConnectionConfiguration.imqReconnectInterval, "5000");
        connectionFactory.
                setProperty(ConnectionConfiguration.imqAckTimeout, "5000");
        connectionFactory.
                setProperty(ConnectionConfiguration.imqPingAckTimeout, "5000");
        connectionFactory.
                setProperty(ConnectionConfiguration.imqAbortOnPingAckTimeout,
                "True");
        connectionFactory.
                setProperty(ConnectionConfiguration.imqSocketConnectTimeout,
                "5000");

15:32:50.155 [main] DEBUG c.bhn.services.jms.ClusterDirectTest - directJms
15:32:50.275 [main] WARN javax.jms - [C4003]: Error occurred on connection creation [glassfish1.corp.local:27676]. - cause: java.net.ConnectException: Connection refused
15:32:55.585 [main] WARN javax.jms - [C4003]: Error occurred on connection creation [glassfish1.corp.local:27676]. - cause: java.net.ConnectException: Connection refused
15:33:00.701 [main] WARN javax.jms - [C4003]: Error occurred on connection creation [glassfish1.corp.local:27676]. - cause: java.net.ConnectException: Connection refused
15:33:05.753 [main] WARN javax.jms - [C4003]: Error occurred on connection creation [glassfish1.corp.local:27676]. - cause: java.net.ConnectException: Connection refused
15:33:10.832 [main] DEBUG c.bhn.services.jms.ClusterDirectTest - Sending 100 messages
15:33:11.466 [main] DEBUG c.bhn.services.jms.ClusterDirectTest - Elapsed time: 627 ms, messages: 100, average: 6.270000 ms
15:33:11.466 [main] DEBUG c.bhn.services.jms.ClusterDirectTest - waitForEmptyQueue
15:33:12.480 [main] DEBUG c.bhn.services.jms.ClusterDirectTest - Elapsed time: 1014 ms
15:33:12.501 [main] DEBUG c.bhn.services.jms.ClusterDirectTest - waitForEmptyTopic
15:33:12.501 [main] DEBUG c.bhn.services.jms.ClusterDirectTest - Waiting for 100 messages to be received
15:33:14.262 [main] DEBUG c.bhn.services.jms.ClusterDirectTest - Elapsed time: 1760 ms, messages: 100, average: 17.600000 ms
15:33:14.291 [main] DEBUG com.bhn.services.jms.ClusterTest - tearDown