Skip to main content

Loadbalancer problems with Apache

11 replies [Last post]
ocoro02
Offline
Joined: 2007-07-18
Points: 0

I'm trying to use the Glassfish LB (mod_loadbalancer.so) with Apache. I'm encountering several problems and wondering whether anyone else has seen these - I can't find anything too likely on the Glassfish issues list. I've seen these issues with multiple versions of the aslb, even trying the very old version shipped in Sun's 8.2 Enterprise Edition (with the 8.2 appserver).

Firstly - any request to Apache will result in the loadbalancer sending requests to the app server http listeners. This is regardless of whether they're mapped in loadbalancer.xml or not. So a request to Apache which results in a 404 to the client will still open network connections to the backends. This is a particular problem to me as I have frontend h/w loadbalancers which send probes to an Apache tier, which results in requests being generated by mod_loadbalancer to the backends (which shouldn't happen ideally ...).

Secondly - I'm seeing a cascade effect of network connections with netstat. The first connection from Apache to the app servers will be established. It'll then go into TIME_WAIT, and another connection is initiated. This doesn't seem to increase in a linear fashion - the rate of new connections being created seems to increase almost exponentially. There is a ceiling however - for instance with one single http request (this was a 404 request as in my first point above), it maxes out at 14 connections for some reason. This state is maintained - established conns go to TIME_WAIT and others drop off - so it stays at 14 connections forever! This isn't very desirable and eventually results in the web server hosts running out of network resources.

I'm using Glassfish v2 final (b58g), various versions of the aslb (current aslb-9.1-MS4-b7), Apache 2.0.59 with patches or 2.0.61, Solaris 10 x86 frontend webservers, Windows Server 2003 R2 appservers (I know, odd choice of OS).

netstat -a from webserver:

10.0.1.28.59057 10.0.1.140.38080 65535 0 49640 0 TIME_WAIT
10.0.1.28.59058 10.0.1.140.38080 65535 0 49640 0 TIME_WAIT
10.0.1.28.59059 10.0.1.140.38080 65535 0 49640 0 TIME_WAIT
10.0.1.28.59060 10.0.1.140.38080 65535 0 49640 0 TIME_WAIT
10.0.1.28.59061 10.0.1.140.38080 65535 0 49640 0 TIME_WAIT
10.0.1.28.59062 10.0.1.140.38080 65535 0 49640 0 TIME_WAIT
10.0.1.28.59063 10.0.1.140.38080 65535 0 49640 0 TIME_WAIT
10.0.1.28.59064 10.0.1.140.38080 65535 0 49640 0 TIME_WAIT
10.0.1.28.59065 10.0.1.140.38080 65535 0 49640 0 TIME_WAIT
10.0.1.28.59066 10.0.1.140.38080 65535 0 49640 0 TIME_WAIT
10.0.1.28.59067 10.0.1.140.38080 65535 0 49640 0 TIME_WAIT
10.0.1.28.59068 10.0.1.140.38080 65535 0 49640 0 TIME_WAIT
10.0.1.28.59069 10.0.1.140.38080 16384 0 49640 0 ESTABLISHED

One the appserver, I only see a max of two connections - one ESTABLISHED, and sometimes one in TIME_WAIT. Could this be a misconfig of http listener threads on the app server perhaps?

Reply viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
Pankaj Jairath

HTTP Load Balancer runs implicit health checks over the listeners of the
participating server instances in the cluster. This would start
occurring upon webserver/LB start-up during it's life cycle and results
in the observation described here. The following issue addresses this
- https://glassfish.dev.java.net/issues/show_bug.cgi?id=3756 .

regards
Pankaj

glassfish@javadesktop.org wrote:
> Hiya - yep it's definitely the plugin, I tried without an no open conns. The conns are all to the defined listener ports on the app servers.
>
> The one ESTABLISHED connection seems to survive for around 10 seconds before going into TIME_WAIT and being replaced by another ESTABLISHED connection. For a single request it gets to a point of a constant number of conns in ESTABLISHED or TIME_WAIT as the older waiting conns drop off, to be replaced each time a new established connection is made. For a single request with apache configured with startservers=1 & maxclients=1, this seems to be 13 conns in TIME_WAIT and 1 ESTABLISHED at any one time.
>
> Could this be due to the app server terminating the connection after 10 secs, and the load balancer reinitialising after it's seeing a connection dropped (unexpectedly?).
>
> I ran truss which didn't tell me too much without seeing the source but there were a number of errors reported - EAGAIN & EINPROGRESS roughly in equal measure.
>
>
> I noticed this post but it's impossible to determine if it's related -
> http://forums.java.net/jive/thread.jspa?messageID=155501&#155501
> [Message sent by forum member 'ocoro02' (ocoro02)]
>
> http://forums.java.net/jive/thread.jspa?messageID=239182
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@glassfish.dev.java.net
> For additional commands, e-mail: users-help@glassfish.dev.java.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@glassfish.dev.java.net
For additional commands, e-mail: users-help@glassfish.dev.java.net

ocoro02
Offline
Joined: 2007-07-18
Points: 0

Hi - I've done some tracing with ethereal and it's a little bit clearer.

Conn#1 opened - Web server with mod_loadbalancer initiates connection with app server

30 seconds passes - this is the value defined in HTTP Service -> Keep Alive -> Time Out (I tried varying it).

Conn#1 closed - App server sends FIN,ACK to web server. Web server sends FIN,ACK back and connection is closed cleanly. It's marked TIME_WAIT in netstat on the web server.

Conn#2 opened - Web server initiates a new connection to app server.

5 seconds passes with no traffic

Web server sends a FIN,ACK to the app server for conn#2

Conn#3 opened - Web server initiates a new connection with app server (before receiving FIN,ACK from app server from previous conn#2).

Conn#2 closed - FIN,ACK received from app server - this terminates the connection and it goes to TIME_WAIT on the web server.

5 seconds passes

Web server sends FIN,ACK to app server for conn#3

Web server initiates conn#4 without waiting for FIN,ACK from conn#3

App server sends back FIN,ACK for conn#3 - it closes and goes to TIME_WAIT

5 seconds passes ...

and on an on it goes.

So the first connection is timing out as I'd expect. Then onwards the mod_loadbalancer initiates new connections every 5 seconds - forever. I don't know where this value of 5 seconds is coming from or why it's doing it.

ocoro02
Offline
Joined: 2007-07-18
Points: 0

Problem occurs if the loadbalancer is connecting direct to Glassfish's http listener, or an Apache server proxying to Glassfish's http. So it doesn't look like a problem with the Glassfish listener.

ocoro02
Offline
Joined: 2007-07-18
Points: 0

It's worth pointing out that the loadbalancer works fine for a short while until the web server becomes overwhelmed by network conns to the app servers in time_wait/established. It'll typically stop working after maybe a hundred requests which equates to perhaps 1400 connections showing in netstat. I've seen it get to tens of thousands, which makes the web server pretty difficult to even login to!

vr143562
Offline
Joined: 2005-06-19
Points: 0

Can you tell us how you are checking that connections have been opened from Apache to the appserver backend ?

I will try and get another pair of eyes to look at this issue.

thanks.

vr143562
Offline
Joined: 2005-06-19
Points: 0

sorry, forgot that you had mentioned netstat in your previous entry.

thanks.

vr143562
Offline
Joined: 2005-06-19
Points: 0

Hi:

Can you share your httpd.conf file's prefork MPM settings. Look for "prefork".
These settings look like these by default for apache 2.0.59.


StartServers 5
MinSpareServers 5
MaxSpareServers 10
MaxClients 150
MaxRequestsPerChild 0

*************

Can you change the StartServers and the MaxClients settings to 1, try your tests again and share the observations?

Here's a possible explanation of what your are observing:
- The LB Plugin opens network connections to each of the listeners specified in loadbalancer.xml.
- These network connections are established when Apache receives the first client request after a restart.
- Now, apache spawns a number of processes to listen to client requests. This number is equal to the "StartServers" value specified in the prefork setting.
- I think each of these processes has its own instance of the LB Plugin and each is establishing network connections with the backend.

Lets hope the above test shows a difference that will help us understand the workings better.

thanks.

varun.

ocoro02
Offline
Joined: 2007-07-18
Points: 0

Hi Varun - StartServers & MaxClients are already set to 1 and I'm running Apache in the default prefork mode (double checked with 'httpd -l'. I tried to make sure I installed everything to the letter of the instructions ... The netstat results I'm seeing are with just one listener defined. My loadbalancer.xml was slightly modified from the one generated from within Glassfish - I've tried varying each of the variables one at a time, no luck. I also tried port 8080 as opposed to 38080 (default for cluster). Each tim I tried different LB versions I made sure all the libs, dtds, res files were updated too. I've double check /usr/lib/mps is set in LD_LIBRARY_PATH and that the libs in there make sense.

I'm slightly suspicious of some odd compatibility between Solaris / Windows socket settings - but have tried using a Linux (Fedora) or a Solaris backend, no luck.





t="100"/>


kshitiz_saxena
Offline
Joined: 2006-05-03
Points: 0

As soon as apache gets a request, it will initialize the loadbalancer sub-system if aslb-plugin is installed. This will open connection to appserver to check status and keep this connection open(To detect instance failure). Hence you will see one connection established at the appserver. This will all happen as part of initialization. Only after initialization it can figure whether the request can be handled by it or not. So even when request need not be serviced by loadbalancer, the connections will be created.

As for other connections in TIME_WAIT state, please remove aslb-lbplugin from Apache and check how many connection are opened at startup. This may not be related to aslb-lbplugin at all.

Thanks,
Kshitiz

ocoro02
Offline
Joined: 2007-07-18
Points: 0

Hiya - yep it's definitely the plugin, I tried without an no open conns. The conns are all to the defined listener ports on the app servers.

The one ESTABLISHED connection seems to survive for around 10 seconds before going into TIME_WAIT and being replaced by another ESTABLISHED connection. For a single request it gets to a point of a constant number of conns in ESTABLISHED or TIME_WAIT as the older waiting conns drop off, to be replaced each time a new established connection is made. For a single request with apache configured with startservers=1 & maxclients=1, this seems to be 13 conns in TIME_WAIT and 1 ESTABLISHED at any one time.

Could this be due to the app server terminating the connection after 10 secs, and the load balancer reinitialising after it's seeing a connection dropped (unexpectedly?).

I ran truss which didn't tell me too much without seeing the source but there were a number of errors reported - EAGAIN & EINPROGRESS roughly in equal measure.

I noticed this post but it's impossible to determine if it's related -
http://forums.java.net/jive/thread.jspa?messageID=155501&#155501

ocoro02
Offline
Joined: 2007-07-18
Points: 0

Typo - "I tried without an no open conns" = "I tried without the plugin installed and saw no open conns" ;)