Skip to main content

HTTP listeners stop responding

15 replies [Last post]
bernhardhaeussermann
Offline
Joined: 2009-10-13
Points: 0

Hi,

we are using Glassfish Enterprise Server v2.1 for our SOAP web services.
Every once in a while some or all of the HTTP listeners stop responding (the admin listener on port 4848 never seems to cause problems). This causes the web service clients to receive time out errors.

When I check Glassfish's logs, I find the following error.

HTTP transport error: java.net.SocketException: Too many open files

Restarting Glassfish fixes this issue.

Any ideas as to what might cause this behaviour and how it could be fixed?

Reply viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
Kristian Rink

[...]
> causes the web service clients to receive time out errors. When I check
> Glassfish's logs, I find the following error. HTTP transport error:
> java.net.SocketException: Too many open files Restarting Glassfish fixes
> this issue. Any ideas as to what might cause this behaviour and how it
> could be fixed?

Used to run into the same error message ("too many open files") which, in our
case, back them obviously was caused by indeed a load of files (jar
dependencies, local documents, scripts, ...) opened by our app. Increasing the
file limits on the local system reliably did cure this. Have a look at

http://www.java.net/forum/topic/glassfish/glassfish/too-many-open-files
http://lj4newbies.blogspot.de/2007/04/too-many-open-files.html ,

maybe this might fix your issues too?

Cheers,
Kristian

bernhardhaeussermann
Offline
Joined: 2009-10-13
Points: 0

Hi Kristian,

I have increased the file descriptor and handle limits as described in that blog.
So far we haven't experienced the issue again, so it looks as though that might have solved the problem. I don't know what opens up all those sockets, though.

bernhardhaeussermann
Offline
Joined: 2009-10-13
Points: 0

Ok. We had Glassfish freeze again. It is possible that increasing the file descriptor and handle limits helped to an extend, but ultimately only postponed the issue.

When Glassfish's listeners froze today, I ran netstat before restarting Glassfish and again after restarting it. I have attached the outputs to this message.

I note that the netstat list before restarting Glassfish is considerably longer than afterwards. A large portion of the connections are to the machine's own address (Excel enables easy analytics of the file), from ports 40060 up to 58614 and several to port "webcache".

It looks like Glassfish (or one of its application servers) opens up sockets over time without closing them. I cannot conclude how one of our web services could cause this. I hope that the netstat output provides clues to what might be going on.

oleksiys
Offline
Joined: 2006-01-25
Points: 0

AFAIK "webcache" is port 8080, normally used by Glassfish HTTP listener.
From the logs you sent I see ~650 open tcp connections (only) before GF restart. It's not that big value and IMO should not cause any problems.

So, may be this file descriptors leak is not directly related to network connections. Try to check file descriptors stats using some utility (like lsof). For example you can check open file descriptors by PID: $lsof | grep

WBR,
Alexey.

bernhardhaeussermann
Offline
Joined: 2009-10-13
Points: 0

Hi oleksiys,

I retrieved the lsof output for the java process (attached).
I looked for multiple handles to the same file, but it does not seem to be the case.

oleksiys
Offline
Joined: 2006-01-25
Points: 0

Hi,

is it the lsof report taken, when the server stops responsonding?
From the report, I don't see many files (including network connections) open, may be there is other process responsible for the file descriptors leak.
Can you pls. run lsof and get number of file descriptors open by each process?

Thanks.

bernhardhaeussermann
Offline
Joined: 2009-10-13
Points: 0

At this stage it seems like the java process has the largest amount of file descriptors.
I ran lsof during normal operation. I will check again when the problem returns.

Roel_D

How big is your threadpool?
What is the threadtime-out?
There are some settings like "associate with thread"

I use liferay on GF3, and liferay has a huge amount of jars and accompying files. But open files are at 1800 at the max on my machines.

Kind regards,

The out-side

Op 7 dec. 2012 om 11:23 heeftforums@java.net het volgende geschreven:

> At this stage it seems like the java process has the largest amount of file
> descriptors. I ran lsof during normal operation. I will check again when the
> problem returns.
>
> --
>
> [Message sent by forum member 'bernhardhaeussermann']
>
> View Post:http://forums.java.net/node/892788
>

Kind regards,

The out-side

Op 7 dec. 2012 om 11:23 heeft forums@java.net het volgende geschreven:

> At this stage it seems like the java process has the largest amount of file
> descriptors. I ran lsof during normal operation. I will check again when the
> problem returns.
>
> --
>
> [Message sent by forum member 'bernhardhaeussermann']
>
> View Post: http://forums.java.net/node/892788
>
>

Roel_D

How big is your threadpool?
What is the threadtime-out?
There are some settings like "associate with thread"

I use liferay on GF3, and liferay has a huge amount of jars and accompying files. But open files are at 1800 at the max on my machines.

Kind regards,

The out-side

Op 7 dec. 2012 om 11:23 heeft forums@java.net het volgende geschreven:

> At this stage it seems like the java process has the largest amount of file
> descriptors. I ran lsof during normal operation. I will check again when the
> problem returns.
>
> --
>
> [Message sent by forum member 'bernhardhaeussermann']
>
> View Post: http://forums.java.net/node/892788
>
>

bernhardhaeussermann
Offline
Joined: 2009-10-13
Points: 0

Both HTTP-listeners (one on 8080 and one on 443) have 3 acceptor-threads.

The thread pool's maximum size is set to 200 and the idle time out is 120 seconds, which I think to be reasonable. The number of work queues is 1.

oleksiys
Offline
Joined: 2006-01-25
Points: 0

Hi,

do your webservices open any client connections?
You may want to use some utility (like netstat) for find out what kind of network connections (if it's really related to network connections) cause this problem.

WBR,
Alexey.

bernhardhaeussermann
Offline
Joined: 2009-10-13
Points: 0

Hi Alexey,

when I run netstat I see a very long list of connections, and a lot of them are to the remote database server (DB2) with status TIME_WAIT.
Interestingly, though, when I list all open database connections in DB2, I always see only 10 to 15 connections at a time. The number of connections to the server as reported by netstat on the Glassfish server is much larger.

Can you offer a possible explanation for this phenomenon?
Do you think that this might be the key to the problem?

oleksiys
Offline
Joined: 2006-01-25
Points: 0

Hi,

I'd trust to netstat.
TIME_WAIT state means the connection is in semi-closed state (more info here [1]). In general you can reduce the time socket waits in the TIME_WAIT state by changing socket linger timeout, but it may cause another problems.
May be it's just some DB2 jdbc driver issue, try to change the driver.

WBR,
Alexey.

[1] http://en.wikipedia.org/wiki/Transmission_Control_Protocol

bernhardhaeussermann
Offline
Joined: 2009-10-13
Points: 0

This issue occurred again today.

When I ran netstat, using grep to only show the connections to the database server, (via the command netstat | grep "50-16-"), I got

tcp        1      0 domU-12-31-39-06-25-C:47149 ec2-50-16-218-250.com:50001 CLOSE_WAIT
tcp        1      0 domU-12-31-39-06-25-C:47613 ec2-50-16-218-250.com:50001 CLOSE_WAIT
tcp        1      0 domU-12-31-39-06-25-C:47854 ec2-50-16-218-250.com:50001 CLOSE_WAIT
tcp        1      0 domU-12-31-39-06-25-C:47861 ec2-50-16-218-250.com:50001 CLOSE_WAIT
tcp        1      0 domU-12-31-39-06-25-C:48039 ec2-50-16-218-250.com:50001 CLOSE_WAIT
tcp        1      0 domU-12-31-39-06-25-C:48057 ec2-50-16-218-250.com:50001 CLOSE_WAIT
tcp        1      0 domU-12-31-39-06-25-C:40936 ec2-50-16-218-250.com:50001 CLOSE_WAIT
tcp        0      0 domU-12-31-39-06-25-C:51361 ec2-50-16-218-250.com:50001 TIME_WAIT
tcp      650      0 domU-12-31-39-06-25-C:https ec2-50-16-218-250.com:59268 ESTABLISHED
tcp      651      0 domU-12-31-39-06-25-C:https ec2-50-16-218-250.com:59260 CLOSE_WAIT

That is, there were not many connections open to the billing service (but there was a very long list of connections to unfamiliar addresses). Similarly, the list of connections as reported by DB2 contained only a few entries for the Glassfish server.

On DB2 I forced all the connections from Glassfish and ran netstat | grep "50-16-" again, receiving the following output:

tcp        0    425 domU-12-31-39-06-25-C:47149 ec2-50-16-218-250.com:50001 LAST_ACK
tcp        1      1 domU-12-31-39-06-25-C:47613 ec2-50-16-218-250.com:50001 LAST_ACK
tcp        1      1 domU-12-31-39-06-25-C:47854 ec2-50-16-218-250.com:50001 LAST_ACK
tcp        1      1 domU-12-31-39-06-25-C:47861 ec2-50-16-218-250.com:50001 LAST_ACK
tcp        1      1 domU-12-31-39-06-25-C:48039 ec2-50-16-218-250.com:50001 LAST_ACK
tcp        1      1 domU-12-31-39-06-25-C:48057 ec2-50-16-218-250.com:50001 LAST_ACK
tcp        1      1 domU-12-31-39-06-25-C:40936 ec2-50-16-218-250.com:50001 LAST_ACK

Glassfish was still unresponsive.

Finally, I restarted Glassfish and everything returned back to normal. netstat reported a much shorter list of connections.

Any ideas?

bernhardhaeussermann
Offline
Joined: 2009-10-13
Points: 0

Thanks! I will keep an eye on it.