GC seems to cause unresponsiveness in java server app

Question

We have made a webapp running in JBoss 4.2.3 on RedHat Linux 5.2 on a 12 core IBM machine. Lately we have seen long response times, which seems to be caused be GC activities. The pattern is more or less as follows: All requests seem to come at normal rate, but no bytes are sent until suddenly all responses are "flushed". At same time the gc.log has this pattern:

// Here the unresponsiveness starts
36788.309: [GC [1 CMS-initial-mark: 1919297K(4456448K)] 1969797K(5111808K), 0.1473460 secs] [Times: user=0.15 sys=0.00, real=0.15 secs]
Total time for which application threads were stopped: 0.1530790 seconds
36788.457: [CMS-concurrent-mark-start]
36793.886: [CMS-concurrent-mark: 5.429/5.429 secs] (CMS-concurrent-mark yielded 46 times)
 [Times: user=6.69 sys=0.07, real=5.43 secs]
36793.886: [CMS-concurrent-preclean-start]
 (cardTable: 15956 cards, re-scanned 15956 cards, 1 iterations)
36793.988: [CMS-concurrent-preclean: 0.089/0.101 secs] (CMS-concurrent-preclean yielded 9 times)
 [Times: user=0.12 sys=0.01, real=0.10 secs]
36793.988: [CMS-concurrent-abortable-preclean-start]
 (cardTable: 75 cards, re-scanned 75 cards, 1 iterations)
Application time: 5.6923460 seconds
Total time for which application threads were stopped: 0.0170000 seconds
 (cardTable: 96 cards, re-scanned 96 cards, 1 iterations)
 (cardTable: 207 cards, re-scanned 207 cards, 1 iterations)
 (cardTable: 45 cards, re-scanned 45 cards, 1 iterations)
 (cardTable: 157 cards, re-scanned 157 cards, 1 iterations)
 (cardTable: 88 cards, re-scanned 88 cards, 1 iterations)
 (cardTable: 213 cards, re-scanned 213 cards, 1 iterations)
 (cardTable: 84 cards, re-scanned 84 cards, 1 iterations)
 (cardTable: 143 cards, re-scanned 143 cards, 1 iterations)
 (cardTable: 40 cards, re-scanned 40 cards, 1 iterations)
 (cardTable: 109 cards, re-scanned 109 cards, 1 iterations)
 (cardTable: 77 cards, re-scanned 77 cards, 1 iterations)
 (cardTable: 156 cards, re-scanned 156 cards, 1 iterations)
 (cardTable: 132 cards, re-scanned 132 cards, 1 iterations)
 (cardTable: 119 cards, re-scanned 119 cards, 1 iterations)
 (cardTable: 115 cards, re-scanned 115 cards, 1 iterations)
 (cardTable: 195 cards, re-scanned 195 cards, 1 iterations)
 (cardTable: 163 cards, re-scanned 163 cards, 1 iterations)
 (cardTable: 166 cards, re-scanned 166 cards, 1 iterations)
 (cardTable: 149 cards, re-scanned 149 cards, 1 iterations)
 (cardTable: 218 cards, re-scanned 218 cards, 1 iterations)
 (cardTable: 46 cards, re-scanned 46 cards, 1 iterations)
 (cardTable: 57 cards, re-scanned 57 cards, 1 iterations)
 (cardTable: 213 cards, re-scanned 213 cards, 1 iterations)
 (cardTable: 189 cards, re-scanned 189 cards, 1 iterations)
 (cardTable: 104 cards, re-scanned 104 cards, 1 iterations)
 (cardTable: 60 cards, re-scanned 60 cards, 1 iterations)
 (cardTable: 368 cards, re-scanned 368 cards, 1 iterations)
 (cardTable: 118 cards, re-scanned 118 cards, 1 iterations)
 (cardTable: 129 cards, re-scanned 129 cards, 1 iterations)
 (cardTable: 291 cards, re-scanned 291 cards, 1 iterations)
 (cardTable: 105 cards, re-scanned 105 cards, 1 iterations)
 (cardTable: 63 cards, re-scanned 63 cards, 1 iterations)
 (cardTable: 73 cards, re-scanned 73 cards, 1 iterations)
 (cardTable: 165 cards, re-scanned 165 cards, 1 iterations)
 (cardTable: 69 cards, re-scanned 69 cards, 1 iterations)
 (cardTable: 157 cards, re-scanned 157 cards, 1 iterations)
 CMS: abort preclean due to time  [37 iterations, 13 waits, 4954 cards)] 36799.161: [CMS-concurrent-abortable-preclean: 2.965/5.173 secs] (CMS-concurrent-abortable-preclean yielded 166 times)
 [Times: user=5.12 sys=0.10, real=5.17 secs]
Application time: 4.9945030 seconds
36799.172: [GC[YG occupancy: 572074 K (655360 K)]36799.172: [GC 36799.173: [ParNew
Desired survivor size 67108864 bytes, new threshold 2 (max 15)
- age   1:   59654872 bytes,   59654872 total
- age   2:   26827232 bytes,   86482104 total
- age   3:    2613360 bytes,   89095464 total
- age   4:     929600 bytes,   90025064 total
- age   5:     893176 bytes,   90918240 total
- age   6:     795736 bytes,   91713976 total
- age   7:     809240 bytes,   92523216 total
- age   8:     936216 bytes,   93459432 total
- age   9:     971376 bytes,   94430808 total
- age  10:     624472 bytes,   95055280 total
- age  11:     883616 bytes,   95938896 total
- age  12:     817296 bytes,   96756192 total
- age  13:     977992 bytes,   97734184 total
- age  14:     733816 bytes,   98468000 total
- age  15:     654744 bytes,   99122744 total
: 572074K->110988K(655360K), 0.1855140 secs] 2491372K->2031059K(5111808K) icms_dc=0 , 0.1866380 secs] [Times: user=0.72 sys=0.00, real=0.19 secs]
36799.359: [Rescan (parallel)  (Survivor:12chunks) Finished young gen rescan work in 3th thread: 0.032 sec
Finished young gen rescan work in 2th thread: 0.035 sec
Finished young gen rescan work in 0th thread: 0.035 sec
Finished young gen rescan work in 1th thread: 0.035 sec
Finished remaining root rescan work in 1th thread: 0.030 sec
Finished remaining root rescan work in 3th thread: 0.033 sec
Finished remaining root rescan work in 2th thread: 0.032 sec
Finished remaining root rescan work in 0th thread: 0.042 sec
Finished dirty card rescan work in 2th thread: 0.029 sec
Finished dirty card rescan work in 3th thread: 0.031 sec
Finished dirty card rescan work in 0th thread: 0.019 sec
Finished dirty card rescan work in 1th thread: 0.031 sec
Finished work stealing in 3th thread: 0.000 sec
Finished work stealing in 0th thread: 0.000 sec
Finished work stealing in 2th thread: 0.000 sec
Finished work stealing in 1th thread: 0.000 sec
, 0.0959640 secs]36799.455: [weak refs processing, 0.0121220 secs] [1 CMS-remark: 1920070K(4456448K)] 2031059K(5111808K), 0.2952350 secs] [Times: user=1.11 sys=0.00, real=0.30 secs]
Total time for which application threads were stopped: 0.3066430 seconds
36799.468: [CMS-concurrent-sweep-start]
36805.824: [CMS-concurrent-sweep: 6.355/6.356 secs] (CMS-concurrent-sweep yielded 515 times)
 [Times: user=8.76 sys=0.24, real=6.35 secs]
36805.824: [CMS-concurrent-reset-start]
36805.903: [CMS-concurrent-reset: 0.079/0.079 secs] (CMS-concurrent-reset yielded 0 times)
 [Times: user=0.64 sys=0.01, real=0.07 secs]
Application time: 6.5590510 seconds
// About here the server starts sending out responses
36806.044: [GC 36806.045: [ParNew
Desired survivor size 67108864 bytes, new threshold 2 (max 15)
- age   1:   55456640 bytes,   55456640 total
- age   2:   54278984 bytes,  109735624 total
: 635276K->131072K(655360K), 0.3230430 secs] 2554507K->2089990K(5111808K) icms_dc=0 , 0.3242750 secs] [Times: user=1.02 sys=0.02, real=0.33 secs]
Total time for which application threads were stopped: 0.3422430 seconds
Application time: 0.7628480 seconds

We have tried various GC setups, but we always end up more or less like this. Our current parameters:

-Xms4096m -Xmx5120m -XX:MaxPermSize=384m
-XX:+UseConcMarkSweepGC -XX:+ExplicitGCInvokesConcurrent 
-XX:+PrintClassHistogram -XX:+PrintGCApplicationConcurrentTime 
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails 
-XX:ReservedCodeCacheSize=64m -XX:NewSize=768m -XX:SurvivorRatio=4 
-XX:+CMSParallelRemarkEnabled -XX:+CMSScavengeBeforeRemark -XX:+CMSIncrementalMode 
-XX:+CMSIncrementalPacing -XX:CMSIncrementalDutyCycleMin=0 -XX:CMSIncrementalDutyCycle=10 
-XX:+UseTLAB -XX:ParallelGCThreads=4 
-XX:+UseParNewGC -XX:MaxGCPauseMillis=1000 -XX:GCTimeRatio=10 
-XX:+CMSPermGenSweepingEnabled 
-XX:+PrintCMSInitiationStatistics -XX:PrintCMSStatistics=1 -XX:+PrintGCTaskTimeStamps 
-XX:+PrintGCTimeStamps 
-verbose:gc -XX:+TraceClassLoading -XX:+TraceClassUnloading 
-XX:+PrintTenuringDistribution

How can we avoid the unresponsiveness?

are you running with -verbose:gc in production? Surely logging all GC activity would add on to unresponsiveness. — matt b
Yes, I am. But all blogs I've read says that GC logging does not slow down the system. Even if it did, I'd guess that the slowdown would be there all the time. — Morten Knudsen
Have you run a profiler to find out what objects are flooding the heap and contributing to the GC runs? — Chris Aldrich
No we haven't. I'm a bit reluctant to do this, since I'm not sure of the impact of running a profiler on a production machine. — Morten Knudsen

Stephen C Stephen C · Accepted Answer · 2010-10-01T15:03:09

The GC logs include lines like this:

Total time for which application threads were stopped: ... seconds

Unless I'm mistaken, they add up to less than 1 second. This does not jive with your webapp going unresponsive for any significant length of time.

I'm going to suggest that the cause may be something else. One possibility is that you are pushing the limits of your machine, and that the extra load of running the GC is causing requests to backlog. For instance, since you've configured 4 parallel GC threads, I'd expect the theoretical max request rate to drop by 33% while the GC is running (16 secs?). That could lead to a significant request backlog.

A second possibility your machine is physical memory starved and GC is pushing it into a mild case of virtual memory thrashing.

GC seems to cause unresponsiveness in java server app

2 Answers