Posted by kcpeppe
on January 8, 2014 at 8:19 AM PST
Every once in a while I run into someone that has a JVM that is running back to back collections and yet the heap is still almost full after each attempt! When they discover that their problem is related to the JVM not having enough memory they often ask the question, why didn't the JVM throw an OutOfMemoryError?
Every once in a while I run into someone that has a JVM that is running back to back collections and yet the heap is still almost full after each attempt! When they discover that their problem is related to the JVM not having enough memory they often ask the question, why didn't the JVM throw an OutOfMemoryError? After all my application is not making any forward progress and the reason is Java heap is almost completely exhausted.
First things first, if you're lucky, you may never realize that your JVM is running a ponzi scheme on you. It’s telling your application that you have a infinite amount of memory so go ahead and just use it. The JVM’s garbage collectors are responsible for maintaining this illusion that all’s well in memory land. But in order to maintain this illusion the garbage collector has to magically freeze time using a “safe-pointing” spell so that the application doesn’t even realize what happened. However, as much as time may have stopped for your application, time hasn’t stopped for those using your application. If your application keeps a lot of live data the garbage collector are going to have to work much harder to maintain the illusion. While your application may not notice how often or for how long it’s been frozen for, your users will. But no matter because your application trusts the JVM and the JVM will work hard, heroically, even foolishly to maintain the illusion when it’s clear the gig is up leaving people wondering, why didn’t my app throw an OOME. Lets follow a GC log fragment to see if we can sort things out
85.578: [GC 85.578: [ParNew: 17024K->2110K(19136K), 0.0352412 secs] 113097K->106307K(126912K), 0.0353280 secs]
This is a normally completing parallel collection of Young most likely triggered by an allocation failure in Eden. The live bytes left behind total 106307K of which 2110K reside in Survivor. This implies the occupancy of Old is now at 104197K. However total heap is 126912K or which 19136K is Young. This implies that Tenured is 107776K. With a little more math we can see that Old is 104197/107777 or just under 97% occupied which is dangerously close to being completely full.
86.306: [CMS-concurrent-abortable-preclean: 0.744/1.546 secs]
86.306: [GC[YG occupancy: 10649 K (19136 K)]86.306: [Rescan (parallel) , 0.0039103 secs]86.310: [weak refs processing, 0.0005278 secs] [1 CMS-remark: 104196K(107776K)] 114846K(126912K), 0.0045393 secs]
86.366: [CMS-concurrent-sweep: 0.055/0.055 secs]
86.367: [CMS-concurrent-reset: 0.001/0.001 secs]
86.808: [GC [1 CMS-initial-mark: 101397K(107776K)] 119665K(126912K), 0.0156781 secs]
The next set of records show that the previous ParNew was running inside a CMS cycle and that cycle did manage to complete. Yet the next record following the end of the CMS cycle is another CMS cycle. The reason why is that the previous cycle recovered 104197K-101397K=2800K, approximately 2.5% tenured leaving GC’s game but one hiccup from being exposed.
86.875: [GC 86.875: [ParNew: 19134K->19134K(19136K), 0.0000167 secs]86.875: [CMS87.090: [CMS-concurrent-mark: 0.265/0.265 secs]
(concurrent mode failure): 101397K->107775K(107776K), 0.7588176 secs] 120531K->108870K(126912K), [CMS Perm : 15590K->15589K(28412K)], 0.7589328 secs]
The next records show that unfortunately for the collector, that one bad thing has just happened, the CMS concurrent-mark barely gets a chance to do it's thing and now that pesky ParNew wants to promote a lot of data into tenured and can’t as there isn’t enough space to do so. To make things worse, the ParNew’s attempts to get memory caused the CMS cycle to fail (concurrent mode failure) and rather than let the application know all is not well and to keep the game going, GC decides to call in the big gun, a Full GC. But even with this move, things aren’t looking good for the Full GC ends up leaving 107775K in Old and it’s only 107776K in size! That’s close to 100% full. However we still can move on as Young contains 1095K (108870K-107775K) which is all stuffed into Survivor. That was a close call but GC still has to scramble if it is to keep the ponzi scheme from being exposed.
87.734: [Full GC 87.734: [CMS: 107775K->107775K(107776K), 0.5109214 secs] 111054K->109938K(126912K), [CMS Perm : 15589K->15589K(28412K)], 0.5110117 secs]
Another Full GC to paper over the problem that the JVM has run out of memory. This one happens at 87.734 seconds after JVM startup. The previous pause time was 0.7589328. Add that to 86.875 and you get 87.634 meaning the application got just about 100ms of run time before it was once again interrupted. But once again the heroic effort buys the GC a wee bit more time before it attempts yet another CMS cycle, a ParNew that fails and reverts (silently in this case) to yet another Full GC all the while stringing along the application into believing all is well when it isn’t
88.246: [GC [1 CMS-initial-mark: 107775K(107776K)] 109938K(126912K), 0.0040875 secs]
88.640: [CMS-concurrent-mark: 0.390/0.390 secs]
88.844: [CMS-concurrent-preclean: 0.204/0.204 secs]
88.844: [CMS-concurrent-abortable-preclean: 0.000/0.000 secs]
88.844: [GC[YG occupancy: 11380 K (19136 K)]88.844: [Rescan (parallel) , 0.0109385 secs]88.855: [weak refs processing, 0.0002293 secs] [1 CMS-remark: 107775K(107776K)] 119156K(126912K), 0.0112696 secs]
88.914: [CMS-concurrent-sweep: 0.059/0.059 secs]
88.915: [CMS-concurrent-reset: 0.001/0.001 secs]
89.260: [GC 89.260: [ParNew: 19135K->19135K(19136K), 0.0000156 secs]89.260: [CMS: 105875K->107775K(107776K), 0.5703972 secs] 125011K->116886K(126912K), [CMS Perm : 15589K->15584K(28412K)], 0.5705219 secs]
89.831: [GC [1 CMS-initial-mark: 107775K(107776K)] 117010K(126912K), 0.0090772 secs]
90.192: [CMS-concurrent-mark: 0.351/0.352 secs]
90.343: [Full GC 90.343: [CMS90.379: [CMS-concurrent-preclean: 0.187/0.187 secs]
(concurrent mode failure): 107775K->104076K(107776K), 0.5815666 secs] 126911K->104076K(126912K), [CMS Perm : 15586K->15585K(28412K)], 0.5816572 secs]
90.973: [GC [1 CMS-initial-mark: 104076K(107776K)] 104883K(126912K), 0.0025093 secs]
91.335: [CMS-concurrent-mark: 0.359/0.359 secs]
91.367: [CMS-concurrent-preclean: 0.031/0.032 secs]
92.136: [GC 92.136: [ParNew: 17024K->17024K(19136K), 0.0000167 secs]92.136: [CMS92.136: [CMS-concurrent-abortable-preclean: 0.054/0.769 secs]
(concurrent mode failure): 104076K->107775K(107776K), 0.5377208 secs] 121100K->110436K(126912K), [CMS Perm : 15588K->15586K(28412K)], 0.5378416 secs]
92.838: [GC [1 CMS-initial-mark: 107775K(107776K)] 112679K(126912K), 0.0050877 secs]
93.209: [CMS-concurrent-mark: 0.366/0.366 secs]
93.425: [CMS-concurrent-preclean: 0.215/0.215 secs]
93.425: [CMS-concurrent-abortable-preclean: 0.000/0.000 secs]
93.425: [GC[YG occupancy: 13921 K (19136 K)]93.425: [Rescan (parallel) , 0.0130859 secs]93.438: [weak refs processing, 0.0002302 secs] [1 CMS-remark: 107775K(107776K)] 121697K(126912K), 0.0134232 secs]
93.505: [CMS-concurrent-sweep: 0.067/0.067 secs]
93.506: [CMS-concurrent-reset: 0.001/0.001 secs]
Yet more back to back CMS cycles with concurrent mode failures all indicating that the garbage collector is just able to keep things going but in all honesty you have to start wondering if the cost is too high and if at this point the better thing to do would be to throw some sort of warning or error.
Defining Out Of Memory
First and foremost, the elephant in the room is that Java heap is too small to support this application. A larger heap would help GC perpetuate the ponzi scheme indefinitely. While the application may not realize that all is not well, end users surely do. We’d really like the application to complain before the users do. Unfortunately we don’t have an AlmostOutOfMemoryError but we can redefine when an OutOfMemoryError is thrown but adjusting the parameters GCTimeLimit and GCHeapFreeLimit.
The default values for GCTimeLimit is 98% meaning that if 98% of recent time is spent in GC than an OutOfMemoryError will be thrown. GCHeapFreeLimit if the amount of free heap available after a collection. The default value for this variable is 2%. If we run the numbers from the GC log fragment we can see that GC does just well enough that it never exceeds either of these thresholds. Thus the GC is able to keep the ponzi scheme limping along. And since these values are some what arbitrary, one could redefine them to say; if you have to work so hard to maintain the illusion maybe it’s best to just throw in the towel and let the application know that it’s struggling for memory. In this case setting the GCHeapFreeLimit to 5% and or setting GCTimeLimit to 90% would have triggered an OutOfMemoryError. This would have been the biggest hint as to why this application was not responding well and would have helped the victims of the ponzi scheme know that they'd been had.