We are experiencing problems with one of our apps in prod.
VM is configured as follows
-XX:MaxPermSize=300M -Xms2560M -Xmx2560M -Xloggc:/app/log/gc-admin-20120619-123754.log -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=80 -XX:+DisableExplicitGC -XX:CMSMaxAbortablePrecleanTime=8000
Two options which i have missed and will be applied are XX:PermSize - Should be the same as MaxPermSize (recommended) UseCMSInitiatingOccupancyOnly when CMSInitiatingOccupancyFraction is used, otherwise the value you specify does not stick!!
However with these changes in the pipleline im not so confident it will solve my problem.
Im seeing concurrent mode failures but when a failure happens, the stop the world collection takes an eternity. At the moment im a bit bemused to as to why!!
Here are some samples
168427.476: [GC [1 CMS-initial-mark: 2135988K(2578880K)] 2141041K(2617216K), 3.1029210 secs] [Times: user=0.02 sys=0.01, real=3.10 secs] 168430.596: [CMS-concurrent-mark-start] 168441.309: [GC 168441.309: [ParNew: 36520K->36520K(38336K), 0.0000210 secs]168441.309: [CMS168747.453: [CMS-concurrent-mark: 309.313/316.857 secs] [Times: user=5.75 sys=2.89, real=316.81 secs] (concurrent mode failure): 2561882K->1310927K(2578880K), 767.0309740 secs] 2598402K->1310927K(2617216K), [CMS Perm : 96774K->96171K(158792K)], 767.0379030 secs] [Times: user=3.87 sys=5.06, real=766.92 secs]
What worries me about the whole STW collection is the timing 766.92secs but only "user=3.87 sys=5.06" of CPU time so what is happening here for the rest of the time?? This is where im puzzled, i cant imagine that stopping all threads in the app takes that long!! Thrashing maybe??
169545.325: [GC [1 CMS-initial-mark: 2141069K(2578880K)] 2166025K(2617216K), 0.0530140 secs] [Times: user=0.05 sys=0.00, real=0.06 secs] 169545.379: [CMS-concurrent-mark-start] 169558.635: [CMS-concurrent-mark: 10.407/13.256 secs] [Times: user=7.58 sys=0.53, real=13.25 secs] 169558.635: [CMS-concurrent-preclean-start] 169558.684: [CMS-concurrent-preclean: 0.048/0.048 secs] [Times: user=0.01 sys=0.00, real=0.05 secs] 169558.684: [CMS-concurrent-abortable-preclean-start] 169560.544: [GC 169560.544: [ParNew169560.605: [CMS-concurrent-abortable-preclean: 0.210/1.921 secs] [Times: user=0.93 sys=0.05, real=1.92 secs] 169560.846: [GC[YG occupancy: 1906 K (38336 K)]169560.846: [Rescan (parallel) , 0.0046910 secs]169560.851: [weak refs processing, 0.0000990 secs] [1 CMS-remark: 2350428K(2578880K)] 2352335K(2617216K), 0.0048570 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 169560.853: [CMS-concurrent-sweep-start] 169568.204: [CMS-concurrent-sweep: 7.351/7.351 secs] [Times: user=0.91 sys=0.09, real=7.34 secs] 169568.204: [CMS-concurrent-reset-start] 169568.211: [CMS-concurrent-reset: 0.007/0.007 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
This one shows no problems
252247.318: [GC [1 CMS-initial-mark: 2069401K(2578880K)] 2075094K(2617216K), 1.5311840 secs] [Times: user=0.01 sys=0.00, real=1.53 secs] 252248.849: [CMS-concurrent-mark-start] 252350.336: [GC 252350.336: [ParNew: 20984K->4222K(38336K), 12.2251190 secs]252362.561: [CMS252520.780: [CMS-concurrent-mark: 161.376/271.922 secs] [Times: user=12.56 sys=1.72, real=271.89 secs] (concurrent mode failure): 2232372K->1061586K(2578880K), 407.2310250 secs] 2240205K->1061586K(2617216K), [CMS Perm : 97525K->97381K(160480K)], 419.4586450 secs] [Times: user=4.23 sys=2.99, real=419.39 secs]
And then another whopper "Times: user=4.23 sys=2.99, real=419.39 secs". CPU time is minor "user=4.23 sys=2.99" but overall time is "419.39". What could cause the VM to hang for so long?? 2.5g should ideally be collected in a STW collection under 10secs!!
Im going to lower the threshold CMSInitiatingOccupancyFraction but i dont think with a collections time like that it is going to help!! Some collections run smooth and some dont but like i said, its the timing that has me worried when a full stop the world happens.
I have read https://blogs.oracle.com/jonthecollector/entry/what_the_heck_s_a
And we are using jdk6.
Anybody ever experience something similar before?