Skip to main content

Glassfish 3.1 Final - High Availability Web Apps Slow, and loses session state

36 replies [Last post]
lprimak
Offline
Joined: 2006-08-22

Hi f
I set up a cluster, and deployed my JSP application onto it. It works great until I turn on high-availability for this application via the Admin console. Once I do that, it becomes very slow, and session state gets lost every 2 requests or so. Disabling high-availability cures the problem. I did run verity_multicast, GMS is running, cluster health is good, followed the documentation, and didn't do anything 'weird' or customized'. I also have in my web application. There are no errors in the log files. When I turn on high-availability, I do get this error very frequently: [#|2011-03-06T02:13:00.297-0500|WARNING|glassfish3.1|org.shoal.ha.cache.command.load_request|_ThreadID=27;_ThreadName=Thread-1;|LoadRequestCommand timed out while waiting for result java.util.concurrent.TimeoutException|#] Can somebody help me with this? Is it a bug? Thanks, Lenny

Reply viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
yhjhoo
Offline
Joined: 2010-12-06

Did you solve this problem? I also have the same issue for weeks.

I followed this blog http://tiainen.sertik.net/2011/03/load-balancing-with-glassfish-31-and.html. But no luck!

shreedhar_ganapathy
Offline
Joined: 2007-01-17

Hi yhjhoo
We need specifics from you as well - as there are multiple posters on
this thread - each one may have a different setup making it hard for us
to tell if there are multiple issues involved.

So here's my request to all folks facing issues - please file JIRA
issues separately for each of your cases and add specific steps to the
issue.

If these are duplicate we can always mark these as such.

In addition to details of the issues you see, we also need information
on GlassFish version, JDK version, OS, Application used, and specific
steps to reproduce.

Lets go from here and see if we can get this resolved soon.

Thanks
Shreedhar

On 10/27/11 12:48 AM, forums@java.net wrote:
> Did you solve this problem? I also have the same issue for weeks.
>
> I followed this
> blog http://tiainen.sertik.net/2011/03/load-balancing-with-glassfish-31-and.html
>
> [1]. But no luck!
>
>
> [1]
> http://tiainen.sertik.net/2011/03/load-balancing-with-glassfish-31-and.html
>
> --
>
> [Message sent by forum member 'yhjhoo']
>
> View Post: http://forums.java.net/node/778666
>
>

lprimak
Offline
Joined: 2006-08-22
lprimak
Offline
Joined: 2006-08-22

Great news! It looks like this problem is fixed in GF 3.1.2!

lprimak
Offline
Joined: 2006-08-22

Sorry all, spoke too soon...

I just tried 3.1.2b20 release, and while the performance is good now and TimeoutExceptions are gone, the Sessino state replication does not work correctly.

It seems the node with the older data is overwriting the node with the newer one:

- a session attribute gets added to node1

- another sesino attribute gets added to node2

- node 1 all of the sudden only sees attribute 2

lprimak wrote:

Great news! It looks like this problem is fixed in GF 3.1.2!

lprimak
Offline
Joined: 2006-08-22
lprimak
Offline
Joined: 2006-08-22

This problem is finally fixed with GF 3.1.2.2 and relaxCacheVersionSemantics=true directive

AJAX failover has issues but I don't believe they are related to this particular error.
Thanks for all your help!

lprimak
Offline
Joined: 2006-08-22

Has there been any progress on this since the release of GF 3.1.1?

I have a very standard setup and I can't imagine no one else is running into this issue,

it seem pretty serious to me.

Thanks

shreedhar_ganapathy
Offline
Joined: 2007-01-17

Hi,

Are you seeing these slowness problems with GF 3.1.1 or with GF 3.1?

Could you share your environment details such as JDK version, OS, Heap
settings, CPU and RAM?

Could you first try deploying the ClusterJSP sample

to your cluster with availability enabled, to see if this sort of
slowness exists with that app ? This will set a baseline for comparisons
since we use clusterjsp for some of our tests as a starting point.

We will take a look into your logs in the meantime.

hth
Shreedhar

On 8/7/11 10:26 AM, forums@java.net wrote:
> Has there been any progress on this since the release of GF 3.1.1?
>
> I have a very standard setup and I can't imagine no one else is running
> into this issue,
>
> it seem pretty serious to me.
>
> Thanks
>
>
> --
>
> [Message sent by forum member 'lprimak']
>
> View Post: http://forums.java.net/node/778666
>
>

lprimak
Offline
Joined: 2006-08-22

I am observing the exact same problem in the GF 3.1.1 release

The app is incredibly slow and losing sessions. If 'Enable Availability' is off, everything works gread.

I also have <distributable/> in the web.xml

Thanks

mk111283
Offline
Joined: 2005-03-29

Looking into this. I have tested the replication module by having 16
threads trying to write and read data with asyncreplication set to
false. In that scenario, I DID NOT see any data loss or degradation in
performance.

I am about to test the replication module using JMeter and cluster.jsp
app. Will post the response in the next couple of days.

Thanks,
--Mahesh

On 10/09/2011 05:34 AM, forums@java.net wrote:
> I am observing the exact same problem in the GF 3.1.1 release
>
> The app is incredibly slow and losing sessions. If 'Enable
> Availability' is
> off, everything works gread.
>
> I also have <distributable/> in the web.xml
>
> Thanks
>
>
> --
>
> [Message sent by forum member 'lprimak']
>
> View Post: http://forums.java.net/node/778666
>
>

lprimak
Offline
Joined: 2006-08-22

Here is what I did to reproduce this:

Deploy clusterjsp.war example on a cluster with 2 nodes with the following modifications:

- add distributable directive into WEB-INF/web.xml

- remove sun-web.cml

- put random varlues into the session, about 5k worth.

After a while, the session gets lost with the following error:

[#|2011-10-10T20:36:09.789-0400|WARNING|glassfish3.1.1|org.shoal.ha.cache.command.load_request|_ThreadID=33;_ThreadName=Thread-2;|LoadRequestCommand timed out while waiting for result java.util.concurrent.TimeoutException|#]

mk111283 wrote:
Looking into this. I have tested the replication module by having 16 threads trying to write and read data with asyncreplication set to false. In that scenario, I DID NOT see any data loss or degradation in performance. I am about to test the replication module using JMeter and cluster.jsp app. Will post the response in the next couple of days. Thanks, --Mahesh On 10/09/2011 05:34 AM, forums@java.net wrote: > I am observing the exact same problem in the GF 3.1.1 release > > The app is incredibly slow and losing sessions. If 'Enable > Availability' is > off, everything works gread. > > I also have <distributable/> in the web.xml > > Thanks > > > -- > > [Message sent by forum member 'lprimak'] > > View Post: http://forums.java.net/node/778666 > >

lprimak
Offline
Joined: 2006-08-22

To add to the follogin,

it seems that the issue crops up when the session does get large, about larger than 1k, and it gets updated frequently

mk111283
Offline
Joined: 2005-03-29

Thanks for the info. Can you confirm that the issue don't arise if the session size is less than 1k?

For session sizes greater than 1k there could be some GMS / Grizzly tuning required. Will check with the GMS / Grizzly team about this.

Thanks,
--Mahesh

On Oct 10, 2011, at 12:14 PM, forums@java.net wrote:

> To add to the follogin,
>
> it seems that the issue crops up when the session does get large, about
> larger than 1k, and it gets updated frequently
>
>
>
>
> --
>
> [Message sent by forum member 'lprimak']
>
> View Post: http://forums.java.net/node/778666
>
>

lprimak
Offline
Joined: 2006-08-22

I am not sure about <1k issue, but I can't reproduce it right now with small sessions.

Can't really tell you its gone, just can't reproduce :)

Thanks for yor help Mahesh!

lprimak
Offline
Joined: 2006-08-22

I've noticed something else odd.

We have the server configured behind apache cluster using AJP protocol.

We also use sticky session loadbalancing:

<Location /clusterjsp>
# ProxyPass balancer://ajpStageCluster/clusterjsp stickysession=JSESSIONID
ProxyPass balancer://ajpStageHACluster/clusterjsp
</Location>

When we disable sticky session (as shown above) the session gets lost only if it is big,

but if we uncomment the first line and use sticky session, the app becomes unresponsive and loses sessions

all the time. The sticky session cookie rewriting seems to have adverse impact when availability is enabled.

mk111283
Offline
Joined: 2005-03-29

That seems like a web container / Load balancer issue. I would expect session losses to be close to zero when sessions are sticky. Can you send us the HTTP response message (including headers, cookies etc.).

Anyway, I will test it myself with Jmeter before suspecting other components.

--Mahesh

On Oct 10, 2011, at 1:51 PM, forums@java.net wrote:

> I've noticed something else odd.
>
> We have the server configured behind apache cluster using AJP protocol.
>
> We also use sticky session loadbalancing:
>
>
> # ProxyPass balancer://ajpStageCluster/clusterjsp stickysession=JSESSIONID
> ProxyPass balancer://ajpStageHACluster/clusterjsp
>
>
> When we disable sticky session (as shown above) the session gets lost only
> if it is big,
>
> but if we uncomment the first line and use sticky session, the app becomes
> unresponsive and loses sessions
>
> all the time. The sticky session cookie rewriting seems to have adverse
> impact when availability is enabled.
>
>
> --
>
> [Message sent by forum member 'lprimak']
>
> View Post: http://forums.java.net/node/778666
>
>

lprimak
Offline
Joined: 2006-08-22

Well, the same exact load balancer configuration works perfectly with Glassfish availability turned off.

It should be the other way around :) sticky sesisons are there because session replication is not :)

I can reproduce the error with small sessions this way, if the sticky sessions are turned off, it's very hard to reproduce the timeout and session loss.

BTW the session loss happens EVERY time the TimeoutException happens.

I wouldn't put the blame on load balancer, but perhaps the combination of it, sticky sessions, JK support in Glassfish and Shoal contribute to tease

this bug to the surface.

I attached th elog files from shoal debugging again.

What I did is just loaded a single page, with no session through sticky session load balancer and 'available' web app.

You get a tons of timeout exceptions from shoal. Someone from the shoal team should be able to parse these, I hope.

Thanks again for your help Mahesh!

If there is trouble uploading the logs, they are at http://hope.nyc.ny.us/~lprimak/files/server_logs.zip

mk111283
Offline
Joined: 2005-03-29

Looked into server-bawweb4.log.

The following set of log entries show that a load request for version 6 arrives before the save for version 6 arrives!!
This is only possible if you had deployed the app with asyncreplication=true (default). Can you confirm that you deployed the app with asynreplication=true?

[#|2011-10-10T23:42:02.633-0400|FINE|glassfish3.1.1|org.shoal.ha.cache.command|_ThreadID=23;_ThreadName=Thread-2;ClassName=org.shoal.ha.cache.impl.interceptor.CommandHandlerInterceptor;MethodName=onReceive;|/stage: Received LoadRequestCommand:35(11362da14db1da505ec8bfba8654) from bawweb3-inst|#]

[#|2011-10-10T23:42:02.633-0400|FINE|glassfish3.1.1|org.shoal.ha.cache.command.load_request|_ThreadID=23;_ThreadName=Thread-2;ClassName=org.shoal.adapter.store.commands.LoadRequestCommand;MethodName=execute;|bawweb4-instLoadRequestCommand:35 received load_request command for 11362da14db1da505ec8bfba8654from bawweb3-inst|#]

[#|2011-10-10T23:42:02.634-0400|FINE|glassfish3.1.1|org.shoal.ha.cache.command.save|_ThreadID=23;_ThreadName=Thread-2;ClassName=org.shoal.ha.cache.impl.store.SimpleStoreableDataStoreEntryUpdater;MethodName=createLoadResponseCommand;|SimpleStoreableDataStoreEntryUpdater.createLoadResp 5 >= 6; rawV.length = [B@1e96744|#]

[#|2011-10-10T23:42:02.748-0400|FINE|glassfish3.1.1|org.shoal.ha.cache.stats|_ThreadID=29;_ThreadName=Thread-2;ClassName=org.shoal.ha.cache.impl.interceptor.ReplicationCommandTransmitterWithMap$BatchedCommandMapDataFrame;MethodName=flushAndTransmit;|flushAndTransmit will flush data because lastTS = 1318304522634; timeStamp = 1318304522634; lastTS = 1318304522634; map.size() = 1; removedKeys.size() = 0|#]

[#|2011-10-10T23:42:02.749-0400|FINE|glassfish3.1.1|org.shoal.ha.cache.stats|_ThreadID=29;_ThreadName=Thread-2;ClassName=org.shoal.ha.cache.impl.interceptor.ReplicationCommandTransmitterWithMap$BatchedCommandMapDataFrame;MethodName=doAddOrRemove;|doAddOrRemove batchThresholdReached.get()=true; inFlightCount = 0; |#]

[#|2011-10-10T23:42:02.749-0400|FINE|glassfish3.1.1|org.shoal.ha.cache.stats|_ThreadID=29;_ThreadName=Thread-2;ClassName=org.shoal.ha.cache.impl.interceptor.ReplicationCommandTransmitterWithMap$BatchedCommandMapDataFrame;MethodName=doAddOrRemove;|Sending batch# 6 to bawweb3-inst; wasActive for (115 millis|#]

[#|2011-10-10T23:42:02.750-0400|FINE|glassfish3.1.1|org.shoal.ha.cache.command.load_response|_ThreadID=35;_ThreadName=Thread-2;ClassName=org.shoal.adapter.store.commands.LoadResponseCommand;MethodName=writeObject;|bawweb4-instLoadResponseCommand:37 sending load_response command for 11362da14db1da505ec8bfba8654 to bawweb3-inst; version = -9223372036854775808; state = NOT_FOUND|#]

…..
……
[#|2011-10-10T23:42:02.933-0400|FINE|glassfish3.1.1|org.shoal.ha.cache.command|_ThreadID=24;_ThreadName=Thread-2;ClassName=org.shoal.ha.cache.impl.interceptor.CommandHandlerInterceptor;MethodName=onReceive;|/stage: Received SaveCommand:33(11362da14db1da505ec8bfba8654) from bawweb3-inst|#]

[#|2011-10-10T23:42:02.934-0400|FINE|glassfish3.1.1|org.shoal.ha.cache.command.save|_ThreadID=24;_ThreadName=Thread-2;ClassName=org.shoal.adapter.store.commands.SaveCommand;MethodName=execute;|/stageSaveCommand:33 received save_command for key = 11362da14db1da505ec8bfba8654 from bawweb3-inst|#]

[#|2011-10-10T23:42:02.934-0400|FINE|glassfish3.1.1|org.shoal.ha.cache.command.save|_ThreadID=24;_ThreadName=Thread-2;ClassName=org.shoal.ha.cache.impl.store.SimpleStoreableDataStoreEntryUpdater;MethodName=executeSave;|SimpleStoreableDataStoreEntryUpdater.executeSave. SAVING ... entry = org.shoal.ha.cache.impl.store.DataStoreEntry@2f7d55; entry.version = 5; cmd.version = 6|#]

[#|2011-10-10T23:42:02.934-0400|FINE|glassfish3.1.1|org.shoal.ha.cache.command.save|_ThreadID=24;_ThreadName=Thread-2;ClassName=org.shoal.ha.cache.impl.store.DataStoreEntryUpdater;MethodName=printEntryInfo;|executeSave:SimpleStoreableDataStoreEntryUpdater:Updated key = 11362da14db1da505ec8bfba8654; entry.version = 6 ; entry.lastAccess = 1318304580115; entry.maxIdle = 7200000|#]

Again, I don't understand why the web container is performing a load when the requests are sticky. Will talk to the web container team and will let you know.

--Mahesh

On Oct 10, 2011, at 2:44 PM, forums@java.net wrote:

> Well, the same exact load balancer configuration works perfectly with
> Glassfish availability turned off.
>
> It should be the other way around :) sticky sesisons are there because
> session replication is not :)
>
> I can reproduce the error with small sessions this way, if the sticky
> sessions are turned off, it's very hard to reproduce the timeout and session
> loss.
>
> BTW the session loss happens EVERY time the TimeoutException happens.
>
> I wouldn't put the blame on load balancer, but perhaps the combination of
> it, sticky sessions, JK support in Glassfish and Shoal contribute to tease
>
> this bug to the surface.
>
> I attached th elog files from shoal debugging again.
>
> What I did is just loaded a single page, with no session through sticky
> session load balancer and 'available' web app.
>
> You get a tons of timeout exceptions from shoal. Someone from the shoal
> team should be able to parse these, I hope.
>
> Thanks again for your help Mahesh!
>
> If there is trouble uploading the logs, they are at
> http://hope.nyc.ny.us/~lprimak/files/server_logs.zip
>
>
> --
>
> [Message sent by forum member 'lprimak']
>
> View Post: http://forums.java.net/node/778666
>
>
>
>

lprimak
Offline
Joined: 2006-08-22

Any progress on this? There is none on my end despite best efforts :)

Thanks again!

Braswell, Steph...
Offline
Joined: 2011-06-27

My group has experienced similar issues. The HA replication doesn't work like it does in our 2.1 environments. We spent some considerable time trying to troubleshoot and resolve the problem with no luck. I even tried some radical things like upgrading the shoal and grizzly JARs to later versions but that didn't help. We opened a ticket with Oracle last month but don't have a resolution yet.

-Stephen

On Oct 24, 2011, at 4:18 PM,
wrote:

> Any progress on this? There is none on my end despite best efforts :)
>
> Thanks again!
>
> --
>
> [Message sent by forum member 'lprimak']
>
> View Post: http://forums.java.net/node/778666

shreedhar_ganapathy
Offline
Joined: 2007-01-17

Hi Stephen
I am assuming that the issue you are reporting is the same one our team
has been working on to try to reproduce - in terms of loss of sessions
but not in terms of slowness.

This forum post seems to point both to slowness and partial loss of
session. So we need to know which one of these is the issue in your case.

The team has been trying to reproduce the scenario of session loss, and
multiple attempts to do so have not yielded the loss of sessions scenario.

What we need from you is exact steps to reproduce including the following :
1. The app used - if its clusterjsp we need the version of the app you
have used so having the war/ear file you used is useful
2. Any specific descriptor settings you have changed - if you have used
defaults let us know - for instance, which of the following session
scopes are you using ? Full session, modified session or modified
attribute ?
3. The command you used to deploy the application - for instance, did
you deploy with availabilityenabled=true and did you do so while
deploying of after deployment ? Did you specify asyncreplication=false?
If you used the admin console, let us know as well.
4. Are you setting relaxVersionSemantics property for an app that uses
Ajax like constructs?
5. When do the session losses occur - every request, occassionally,
deterministically?
6. What is your session size ? Is it very large ? 2k, 10k, 200k, 200
mb.....?
7. How many instances do you have in your cluster ? Are these on
physical nodes or on virtual machines ?
8. What is the user request traffic rate ? IOW, how many concurrent
users do you use in your tests and at what rate are sessions being
changed? Are you failing any instances into your test scenario?
9. Are you using an LB - let us know which one and if you are using
sticky sessions?

Thanks
Shreedhar

On 10/26/11 8:01 AM, Braswell, Stephen wrote:
> My group has experienced similar issues. The HA replication doesn't work like it does in our 2.1 environments. We spent some considerable time trying to troubleshoot and resolve the problem with no luck. I even tried some radical things like upgrading the shoal and grizzly JARs to later versions but that didn't help. We opened a ticket with Oracle last month but don't have a resolution yet.
>
>
> -Stephen
>
> On Oct 24, 2011, at 4:18 PM,
> wrote:
>
>> Any progress on this? There is none on my end despite best efforts :)
>>
>> Thanks again!
>>
>> --
>>
>> [Message sent by forum member 'lprimak']
>>
>> View Post: http://forums.java.net/node/778666

lprimak
Offline
Joined: 2006-08-22

- Both session loss and slowness coincide directly with the TimeoutException

- The app used is our internal app, we are having trouble to reproducing this wit cluster.jsp directly

- aside from the ---distributable--- directive, there is no tuning in web.xml, there is no glassfish-web.xml at all

- application is deployed from the Admin GUI, with no changes in any of the checkboxes, aside from the 'availability

- availability is set at deployment time, not after

- no relaxVersionSemantics property

- session loss occurs frequently but not always, but always there is shoal TimeoutException in the logs that corresponds to session loss

- session size is around 50k

- cluster has 2 nodes, both are full (not virtual) machines

- There is no traffic (test server) just sitting trying to use the app with one browser

- The issue happens whether you use a load balancer or not, even when hitting the server directly,

although it's much easier to reproduce with a sticky-session load balancer (apache/mod-proxy-ajp)

Thank you!

shreedhar_ganapathy wrote:
Hi Stephen I am assuming that the issue you are reporting is the same one our team has been working on to try to reproduce - in terms of loss of sessions but not in terms of slowness. This forum post seems to point both to slowness and partial loss of session. So we need to know which one of these is the issue in your case. The team has been trying to reproduce the scenario of session loss, and multiple attempts to do so have not yielded the loss of sessions scenario. What we need from you is exact steps to reproduce including the following : 1. The app used - if its clusterjsp we need the version of the app you have used so having the war/ear file you used is useful 2. Any specific descriptor settings you have changed - if you have used defaults let us know - for instance, which of the following session scopes are you using ? Full session, modified session or modified attribute ? 3. The command you used to deploy the application - for instance, did you deploy with availabilityenabled=true and did you do so while deploying of after deployment ? Did you specify asyncreplication=false? If you used the admin console, let us know as well. 4. Are you setting relaxVersionSemantics property for an app that uses Ajax like constructs? 5. When do the session losses occur - every request, occassionally, deterministically? 6. What is your session size ? Is it very large ? 2k, 10k, 200k, 200 mb.....? 7. How many instances do you have in your cluster ? Are these on physical nodes or on virtual machines ? 8. What is the user request traffic rate ? IOW, how many concurrent users do you use in your tests and at what rate are sessions being changed? Are you failing any instances into your test scenario? 9. Are you using an LB - let us know which one and if you are using sticky sessions? Thanks Shreedhar On 10/26/11 8:01 AM, Braswell, Stephen wrote: > My group has experienced similar issues. The HA replication doesn't work like it does in our 2.1 environments. We spent some considerable time trying to troubleshoot and resolve the problem with no luck. I even tried some radical things like upgrading the shoal and grizzly JARs to later versions but that didn't help. We opened a ticket with Oracle last month but don't have a resolution yet. > > > -Stephen > > On Oct 24, 2011, at 4:18 PM,

lprimak
Offline
Joined: 2006-08-22

A slight update to this:

I _CAN_ reproduce the shoal TimeoutException with cluster.jsp, just not session loss.

This leads me to believe if the TimeoutException is fixed, so will the session loss.

Also, it might not be a complete session loss, just corruption. The key is to fix TimeoutException, I believe,

and the problems will go away.

It also might be that cluster.jsp accesses the session once, but real apps access it in parallel from many pages (linked) sinultaneously,

only if even one user is browing the web site. Something to think about.

lprimak wrote:

- Both session loss and slowness coincide directly with the TimeoutException

- The app used is our internal app, we are having trouble to reproducing this wit cluster.jsp directly

- aside from the ---distributable--- directive, there is no tuning in web.xml, there is no glassfish-web.xml at all

- application is deployed from the Admin GUI, with no changes in any of the checkboxes, aside from the 'availability

- availability is set at deployment time, not after

- no relaxVersionSemantics property

- session loss occurs frequently but not always, but always there is shoal TimeoutException in the logs that corresponds to session loss

- session size is around 50k

- cluster has 2 nodes, both are full (not virtual) machines

- There is no traffic (test server) just sitting trying to use the app with one browser

- The issue happens whether you use a load balancer or not, even when hitting the server directly,

although it's much easier to reproduce with a sticky-session load balancer (apache/mod-proxy-ajp)

Thank you!

shreedhar_ganapathy wrote:
Hi Stephen I am assuming that the issue you are reporting is the same one our team has been working on to try to reproduce - in terms of loss of sessions but not in terms of slowness. This forum post seems to point both to slowness and partial loss of session. So we need to know which one of these is the issue in your case. The team has been trying to reproduce the scenario of session loss, and multiple attempts to do so have not yielded the loss of sessions scenario. What we need from you is exact steps to reproduce including the following : 1. The app used - if its clusterjsp we need the version of the app you have used so having the war/ear file you used is useful 2. Any specific descriptor settings you have changed - if you have used defaults let us know - for instance, which of the following session scopes are you using ? Full session, modified session or modified attribute ? 3. The command you used to deploy the application - for instance, did you deploy with availabilityenabled=true and did you do so while deploying of after deployment ? Did you specify asyncreplication=false? If you used the admin console, let us know as well. 4. Are you setting relaxVersionSemantics property for an app that uses Ajax like constructs? 5. When do the session losses occur - every request, occassionally, deterministically? 6. What is your session size ? Is it very large ? 2k, 10k, 200k, 200 mb.....? 7. How many instances do you have in your cluster ? Are these on physical nodes or on virtual machines ? 8. What is the user request traffic rate ? IOW, how many concurrent users do you use in your tests and at what rate are sessions being changed? Are you failing any instances into your test scenario? 9. Are you using an LB - let us know which one and if you are using sticky sessions? Thanks Shreedhar On 10/26/11 8:01 AM, Braswell, Stephen wrote: > My group has experienced similar issues. The HA replication doesn't work like it does in our 2.1 environments. We spent some considerable time trying to troubleshoot and resolve the problem with no luck. I even tried some radical things like upgrading the shoal and grizzly JARs to later versions but that didn't help. We opened a ticket with Oracle last month but don't have a resolution yet. > > > -Stephen > > On Oct 24, 2011, at 4:18 PM,

lprimak
Offline
Joined: 2006-08-22

The clocks on the machines are about one minute apart. Could this be the cause? Unfortunately I can't run NTP on them.

I did some more tests, and the sticky sessions have nothing to do with this problem.

It turns out when I took the sticky sessions away, the reason that alleviated the problem is because the web page load

would load its components from both servers simultaneously, thus ruining the integrity of the test.

I ran another test connecting to Glassfish directly through https, and the results were as bad as through a load balancer.

lprimak
Offline
Joined: 2006-08-22

I set the clocks on the machine within a second of each other and still i have the problem :(

lprimak
Offline
Joined: 2006-08-22

Also, if the session is particularly large, I get this in the log file:

server.log_2011-10-10T23-35-19:[#|2011-10-10T20:18:52.715-0400|WARNING|glassfish3.1.1|ShoalLogger|_ThreadID=47;_ThreadName=Thread-2;|GMS1073: Multicast datagram of size 105,888 exceeds max multicast size 65,536|#]

lprimak
Offline
Joined: 2006-08-22

Thanks for looking at this, Mahesh,

I deployed the app using the Admin console, so if aync=true is the default, then that's what I did :)

justin.gronfur
Offline
Joined: 2011-03-30

I have observed this on both 3.1 final and 3.1.1 RC2 (build 12).

I am running CentOS 5.4 64-bit with Java 1.6u26 64-bit. I have tried the ClusterJSP sample app and that works fine.

During debugging I have tried switching to synchronous replication and decreased the broadcast delay, and that doesn't seem to resolve the problem at all.

Given that it is a TimeoutException, I'm wondering if it is caused by having too much information contained within a user session. We use Spring extensively, Spring Security for authentication, and using session and global session scoped beans to store session information which consists of several serialized java objects that contain much more data than a typical web session.

Any further debugging steps or advice is appreciated.

Thanks!

justin.gronfur
Offline
Joined: 2011-03-30

Does anyone have an answer to this? I'm getting this same problem and can't find anything out about it.
I usually see it close to the following error as well, but I don't know if they are related.
[#|2011-04-01T22:36:04.743-0500|WARNING|glassfish3.1|javax.enterprise.system.container.web.org.glassfish.web.ha.session.management|_ThreadID=56;_ThreadName=Thread-1;|Exception occurred in getSession
java.io.StreamCorruptedException: invalid type code: 00

ai109478
Offline
Joined: 2005-03-29

[Posting this message on behalf of Mahesh]
Could you do the following?

asadmin set-log-levels --target <your-cluster-name> org.shoal.ha=FINE

then restart your cluster and domain.

Then run your app and post all the server.log(s)

lprimak
Offline
Joined: 2006-08-22

I did that. There are about 1.5M worth of logs.
I zipped them up and put them on my server at http://hope.nyc.ny.us/~lprimak/files/server_logs.zip
Thanks for looking into this. It's very easy to reproduce.
Lenny

lprimak
Offline
Joined: 2006-08-22

Just FYI, I still have no solution to this :(

lprimak
Offline
Joined: 2006-08-22

I just realized that the server wasn't running that hosts the logs.
It is running now! Thank you!

mk111283
Offline
Joined: 2005-03-29

Took a quick look at the server.logs.

There are two reasons which could lead to slow replication. 1. Slow serialization and 2. Too maby TimeoutExceptions (due to too many load requests)

The time to serialize your application data could be long if it involves objects/classes from many OSGi module. One way to find it out is to enable web container loggint to FINE.

Regarding TimeoutException, I found one interesting observation: I see a bunch of load requests for a particular sessionID: F8088F2A1F93F2347AD29CE580CE6CE1

I couldn't find any save messages for the above sesisonID.

Also, use async replication to get a better performance. synchronous replication is slower than async replicartion. BTW, what is the typical size of your apps' http sesison data?

lprimak
Offline
Joined: 2006-08-22

My application is very simple. The only thing in the session is a login ID and some tokens (less than 1k worth)

of basically string values. There is no heavy load whatsoever. Its just one user (me) trying to test it out with a browser.

This is not an esoteric case. I get the same results with JSPCluster example as well.

Thanks for your help! Lenny