Low Pause Collector - concurrent mode failure

Question

We are experiencing problems with one of our apps in prod.

VM is configured as follows

-XX:MaxPermSize=300M -Xms2560M -Xmx2560M -Xloggc:/app/log/gc-admin-20120619-123754.log -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=80 -XX:+DisableExplicitGC -XX:CMSMaxAbortablePrecleanTime=8000

Two options which i have missed and will be applied are XX:PermSize - Should be the same as MaxPermSize (recommended) UseCMSInitiatingOccupancyOnly when CMSInitiatingOccupancyFraction is used, otherwise the value you specify does not stick!!

However with these changes in the pipleline im not so confident it will solve my problem.

Im seeing concurrent mode failures but when a failure happens, the stop the world collection takes an eternity. At the moment im a bit bemused to as to why!!

Here are some samples

168427.476: [GC [1 CMS-initial-mark: 2135988K(2578880K)] 2141041K(2617216K), 3.1029210 secs] [Times: user=0.02 sys=0.01, real=3.10 secs] 168430.596: [CMS-concurrent-mark-start] 168441.309: [GC 168441.309: [ParNew: 36520K->36520K(38336K), 0.0000210 secs]168441.309: [CMS168747.453: [CMS-concurrent-mark: 309.313/316.857 secs] [Times: user=5.75 sys=2.89, real=316.81 secs] (concurrent mode failure): 2561882K->1310927K(2578880K), 767.0309740 secs] 2598402K->1310927K(2617216K), [CMS Perm : 96774K->96171K(158792K)], 767.0379030 secs] [Times: user=3.87 sys=5.06, real=766.92 secs]

What worries me about the whole STW collection is the timing 766.92secs but only "user=3.87 sys=5.06" of CPU time so what is happening here for the rest of the time?? This is where im puzzled, i cant imagine that stopping all threads in the app takes that long!! Thrashing maybe??

169545.325: [GC [1 CMS-initial-mark: 2141069K(2578880K)] 2166025K(2617216K), 0.0530140 secs] [Times: user=0.05 sys=0.00, real=0.06 secs] 169545.379: [CMS-concurrent-mark-start] 169558.635: [CMS-concurrent-mark: 10.407/13.256 secs] [Times: user=7.58 sys=0.53, real=13.25 secs] 169558.635: [CMS-concurrent-preclean-start] 169558.684: [CMS-concurrent-preclean: 0.048/0.048 secs] [Times: user=0.01 sys=0.00, real=0.05 secs] 169558.684: [CMS-concurrent-abortable-preclean-start] 169560.544: [GC 169560.544: [ParNew169560.605: [CMS-concurrent-abortable-preclean: 0.210/1.921 secs] [Times: user=0.93 sys=0.05, real=1.92 secs] 169560.846: [GC[YG occupancy: 1906 K (38336 K)]169560.846: [Rescan (parallel) , 0.0046910 secs]169560.851: [weak refs processing, 0.0000990 secs] [1 CMS-remark: 2350428K(2578880K)] 2352335K(2617216K), 0.0048570 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 169560.853: [CMS-concurrent-sweep-start] 169568.204: [CMS-concurrent-sweep: 7.351/7.351 secs] [Times: user=0.91 sys=0.09, real=7.34 secs] 169568.204: [CMS-concurrent-reset-start] 169568.211: [CMS-concurrent-reset: 0.007/0.007 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]

This one shows no problems

252247.318: [GC [1 CMS-initial-mark: 2069401K(2578880K)] 2075094K(2617216K), 1.5311840 secs] [Times: user=0.01 sys=0.00, real=1.53 secs] 252248.849: [CMS-concurrent-mark-start] 252350.336: [GC 252350.336: [ParNew: 20984K->4222K(38336K), 12.2251190 secs]252362.561: [CMS252520.780: [CMS-concurrent-mark: 161.376/271.922 secs] [Times: user=12.56 sys=1.72, real=271.89 secs] (concurrent mode failure): 2232372K->1061586K(2578880K), 407.2310250 secs] 2240205K->1061586K(2617216K), [CMS Perm : 97525K->97381K(160480K)], 419.4586450 secs] [Times: user=4.23 sys=2.99, real=419.39 secs]

And then another whopper "Times: user=4.23 sys=2.99, real=419.39 secs". CPU time is minor "user=4.23 sys=2.99" but overall time is "419.39". What could cause the VM to hang for so long?? 2.5g should ideally be collected in a STW collection under 10secs!!

Im going to lower the threshold CMSInitiatingOccupancyFraction but i dont think with a collections time like that it is going to help!! Some collections run smooth and some dont but like i said, its the timing that has me worried when a full stop the world happens.

I have read https://blogs.oracle.com/jonthecollector/entry/what_the_heck_s_a

And we are using jdk6.

Anybody ever experience something similar before?

If you've read that blog, have you tried the suggestions that it makes? — Stephen C
What make you thing "419.39" in your last example is time of STW pause? Were your application really experiencing that pause. — Alexey Ragozin
stackoverflow.com/questions/556405/… . Hard to say how long a "user pause" is, what is easy to say is that it shouldnt be that high for 2.5g app!! @StephenC Yes i have and will but i wasnt convinced those changes would drastically drop the times down for me. Hence the post. — JBOnThePc
Well you have your answers now. And my answer is (basically) try the things that the blog says :-) — Stephen C

Stephen C Stephen C · Accepted Answer · 2012-06-22T12:23:58

As you observed, when concurrent mode fails, the falls back to stop-the-world collection. My understanding is that this may be done using a mark-sweep-compact collector rather than the more efficient copying collector.

This doesn't entirely explain why the collection takes so long. However, VM thrashing is a plausible theory and your evidence supports this ... but you'd need to get some OS-level measurements of VM swap/paging rates to be sure. (If a JVM is going to cause thrashing, it is most likely to be worst during a full garbage collection when the heap is full.)

Getting back to what caused the concurrent mode failure, the blog that you linked to says what is most likely happening:

Your heap is to full, or
The object allocation rate is too high, or
The object allocation rate is too variable, or
Some combination of the above.

The solutions suggested are:

Increase the heap size.
Reduce the CMSInitiatingOccupancyFraction value
Increase the CMSIncrementalSafetyFactor value

Another one is to switch to the Throughput collector, and suffer the occasional "longish" pauses when it does a full collection.

If the problem is VM thrashing, then you are between a rock an a hard place. You've over-allocated your virtual memory relative to the amount of physical RAM available to the machine or virtual. Your options are either to give the machine/virtual more RAM, or to reduce the system's virtual memory usage by reducing heap sizes, stopping services and applications, and so on.

(Note that virtual memory thrashing can occur whether or not you are using virtualization. Is just that with virtualization the temptation to over-allocate memory is stronger ...)

Low Pause Collector - concurrent mode failure

3 Answers