Skip to main content

Long GC Pauses on the Young Generation (Scavenge)

3 replies [Last post]
vbatista
Offline
Joined: 2004-10-04

Reply viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
vbatista
Offline
Joined: 2004-10-04

Hi Brian
Thanks for your reply and tips. I have changed my startup options to have minimum and maximum memory areas with the same size:
-XX:+UseConcMarkSweepGC
-XX:+DisableExplicitGC
-XX:PermSize=64m
-XX:MaxPermSize=64m
-Xms512m
-Xmx512m
-Xss256k
-Xincgc
-XX:NewSize=200m
-XX:MaxNewSize=200m
-XX:+AlwaysPreTouch

With this configuration my problems persisted.

I am using Mina, which uses DirectBuffers. Following your tip about NIO, I searched for the full Java 6 Options and I found MaxDirectMemorySize. This way I added:
-XX:MaxDirectMemorySize=128m

And as a miracle, I stopped having long pauses. I got my RTSP proxy running for 12 hours with more than 200 clients connected and my maximum pause has been of 0.4 seconds!

Although, I still don’t understand two things:
1) Adding this property, I am limiting the maximum memory for direct buffers. If my applications is working ok with this configuration, why wasn’t this memory been correctly cleaned when I hadn’t the property?
2) Looking for my startup options, I have 512 MB + 64 MB + 128 MB = 704 MB. Tracing memory consumption, I could observe that my application needs 1.5 GB of memory. This is the double of the max memory I configured with startup options. I know that the JVM has some overheads, but in this case the OS is using the double of the memory.

Any one has any hint for these questions?

Thanks in advance.
Best regards,
Victor Batista

briand
Offline
Joined: 2005-07-11

> 1) Adding this property, I am limiting the maximum memory for direct buffers. If my applications is
> working ok with this configuration, why wasn’t this memory been correctly cleaned when I hadn’t the property?

That's a reasonable question, but unfortunately I don't have an answer
for you. It could be that your 3rd party class library is catching OOM error
when allocating direct bytebuffer objects and using that as the upper bound
for its cache of these objects. By setting the limit, you change the the size
of this cache. That's just conjecture, though.

> 2) Looking for my startup options, I have 512 MB + 64 MB + 128 MB = 704 MB. Tracing memory consumption, I
> could observe that my application needs 1.5 GB of memory. This is the double of the max memory I
> configured with startup options. I know that the JVM has some overheads, but in this case the OS is using
> the double of the memory.

The JVM has text and data of its own. One component of that is the Code Cache,
which is where the JVM stores the executable native code for JIT'ed methods.
For 32-bit linux, this typically starts at about 2m and grows up to 48m. Also, don't
forget the 128k*Number of Threads and any memory used for mapped files
(including the central index for each jar file). Furthermore, c-heap allocation
might also occur from native code. All this adds up. Whether it adding up to
1.5GB when the java heap is only 512m makes sense or not depends on the
application.

The best way to get a feeling for where the process virtual memory is going is
to look at the /proc/
/map file (or run the pmap command, if your distro
includes it). This will give you a map of the process address space. Some
things are not readily identifiable from this address map, but it's usually
quite revealing.

briand
Offline
Joined: 2005-07-11

There are a number of potential causes for high variance in GC times. Paging
is one particularly bad one and really should be avoided at all costs. Small amounts
of paging might not be too costly, but any significant paging activity will kill your
GC performance (and quite possibly other aspects of performance as well).

You mention that you have set -Xmx764m, -XX:MaxPermSize=64m, -Xss1m
and that you have about 20 threads. That accounts for about 848m of virtual
address space. The JVM has other memory resources as well, but really shouldn't
account for all of the difference. You mention that you are using NIO - are you
using direct or mapped byte buffers? If so, those will consume address space
and that space will not be freed until the objects are no longer referencable
and the reference processing threads have run to release them. Furthermore,
if you create a lot of these resources in a short period of time, it's easy consume
a lot of memory. If you create these objects regularly, that can result in a large
amount of address space being consumed by these resources. NIO direct and
mapped byte buffers are one of the few cases where pooling is advised - create
a fixed pool of these things and reuse them where possible.

There are other possible causes of high GC time variance. I'll try to summarize
the one's I'm aware of here:

* -Xms != -Xmx (and similarly -XX:PermSize != -XX:MaxPermSize). When these
are not set to the same value, it instructs the JVM to commit and release real
memory (RES as you describe it) as needed. This allows the Java process to be
good OS citizen. However, if you are pause time sensitive it's important to know
that committing pages has a cost, and that cost can be variable under certain
circumstances - particularly when memory resources are low. This cost can be
compounded if you are using large pages and the OS attempts page coalescing
to get you a large page. For server applications with pause time constraints, it's
generally best to set -Xms == -Xmx.

Since you don't have -Xms == -Xmx, this could be causing some of your issues,
as finding pages of real memory during times of memory shortfalls is likely to be
expensive. Setting this condition might help with the pause times, but it will also
be likely to add more pressure to the virtual memory system. More RAM, or reducing
memory pressure, is a likely solution here.

* CMS Old Gen fragmentation. As the old generation get fragmented, promotions
from the young generation become more costly, as free lists have to be searched
to find space for the promotions. This is a difficult one to see, but the best way to
prevent it is to prevent premature promotions into the old gen by sizing the young
gen and the survivor spaces large enough that short lived object die in the young gen.

Note that since your pause times are showing up as system time in minor collections,
this is not likely the problem you are seeing here.

* Paging artifacts - There are unusual conditions in some OS's that result in some
JVM implementation specifics to encounter page thrashing that results in longer
pause times. Typically, though, these pause time artifacts don't show up as GC time -
instead they tend to show up as pauses that are longer than the reported GC times.
The typical work around for this issue is -XX:+UseMembar, which it appears that
you have tried. To determine if you really need UseMembar, you should add
-XX:+PrintGCApplicationConcurrentTime and -XX:+PrintGCApplicationStoppedTime
along with -verbosegc -XX:+PrintGCTimeStamps. This will include timing information
about the amount of time between safe points and the amount of time in a safe point.
If the elapsed time between the start of a safe point (the time stamp of the
PrintGCApplicationConcurrentTime message) and the tend of the safe point (
the ime stamp of the PrintGCApplicationStoppedTime message) is longer than
the reported GC time (for GC safepoints), then it's likely that +UseMembar will help.
In a recent (1.5.0_13 and 1.6.0_04, I believe) a change to the safepoint code was
introduced that should eliminate the need for -XX:+UseMembar.

Note that since your pause times show up in the GC time (and in particular in
system time metric), The +UseMembar workaround for 6546278 is not applicable
to this particular situation. That's not to say that +UseMembar isn't avoiding 6546278
for your application, though, it may well be.

* First page touch costs - The first time the JVM touches a page of the heap, there's
a page zeroing cost that must be paid. This zeroing is a function of the OS, not the
JVM. Depending on the page size, this zeroing operation can be observable in
GC pause times (particularly during copy and promotion operations). In 1.6.0, a
new option was added, called -XX:+AlwayPreTouch, that shifts this first touch cost
to JVM startup time instead of at copy or promotion time. This issue was typically
observed during the first few minor GC events, as objects were copied to the survivor
spaces, but it also occurs on first touchs in both the eden, old gen, and perm gen
spaces.

Note that you don't seem to be using large pages, so this may not be relevant to
your situation.

HTH
Brian