This is a long running server app which creates a lot of short lived garbage and almost nothing tenured after startup. Most of the time the young gen collection goes quickly, even with 10 gb it's fast since it's almost all garbage, but we are seeing occasional brutal outliers. As a potential hint, we are running a similarly configured server at lower load on a machine with more CPU resources but 12 gb less memory. We are not seeing this pattern there.
java -Xms20g -Xmx20g -Xloggc:"./logs/gc.log" -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:NewRatio=1 -XX:+DisableExplicitGC -XX:+UseSpinning -XX:PreBlockSpin=1000 -XX:-UseCounterDecay -XX:+UseTLAB -XX:+TieredCompilation -XX:ReservedCodeCacheSize=512m
Here is a snippet from the GC logs. Is it safe to say that it's spending the full time actually doing GC work (instead of waiting for threads to reach safepoints, for example) since sys time is high as well?
2013-12-13T14:04:50.388+0000: 56748.603: [GC [PSYoungGen: 10363808K->1248K(10421824K)] 10569323K->207003K(20907584K), 0.0035110 secs] [Times: user=0.03 sys=0.00, real=0.00 secs]
2013-12-13T14:17:32.245+0000: 57510.460: [GC [PSYoungGen: 10363808K->1600K(10435072K)] 10569563K->207563K(20920832K), 0.0038460 secs] [Times: user=0.04 sys=0.00, real=0.01 secs]
2013-12-13T14:29:08.572+0000: 58206.787: [GC [PSYoungGen: 10381824K->1408K(10430912K)] 10587787K->207875K(20916672K), 0.0045710 secs] [Times: user=0.04 sys=0.00, real=0.01 secs]
2013-12-13T14:36:25.176+0000: 58643.391: [GC [PSYoungGen: 10381632K->1600K(10442432K)] 10588099K->208371K(20928192K), 0.0040800 secs] [Times: user=0.03 sys=0.00, real=0.00 secs]
2013-12-13T14:44:04.409+0000: 59102.624: [GC [PSYoungGen: 10397120K->1600K(10438848K)] 10603891K->208715K(20924608K), 14.7387710 secs] [Times: user=0.00 sys=143.05, real=14.74 secs]
2013-12-13T14:52:50.287+0000: 59628.502: [GC [PSYoungGen: 10397120K->11488K(10446720K)] 10604235K->218819K(20932480K), 0.0127280 secs] [Times: user=0.11 sys=0.00, real=0.01 secs]
2013-12-13T14:59:34.934+0000: 60033.149: [GC [PSYoungGen: 10418208K->1536K(10445760K)] 10625539K->218931K(20931520K), 211.1968100 secs] [Times: user=0.00 sys=2067.08, real=211.16 secs]
2013-12-13T15:12:04.478+0000: 60782.693: [GC [PSYoungGen: 10408256K->15136K(10448064K)] 10625651K->232723K(20933824K), 0.0147670 secs] [Times: user=0.00 sys=0.12, real=0.01 secs]
2013-12-13T15:23:02.123+0000: 61440.338: [GC [PSYoungGen: 10426912K->12672K(10448768K)] 10644499K->232635K(20934528K), 0.0141040 secs] [Times: user=0.00 sys=0.12, real=0.02 secs]
2013-12-13T15:31:09.119+0000: 61927.334: [GC [PSYoungGen: 10424448K->5280K(10452160K)] 10644411K->232907K(20937920K), 0.8134440 secs] [Times: user=1.22 sys=6.30, real=0.82 secs]
2013-12-13T15:39:25.083+0000: 62423.298: [GC [PSYoungGen: 10421024K->6208K(10449344K)] 10648651K->235547K(20935104K), 0.0087890 secs] [Times: user=0.06 sys=0.00, real=0.01 secs]
2013-12-13T15:47:52.781+0000: 62930.996: [GC [PSYoungGen: 10421952K->7200K(10456960K)] 10651291K->239123K(20942720K), 0.0094460 secs] [Times: user=0.08 sys=0.00, real=0.01 secs]
2013-12-13T15:58:18.594+0000: 63556.809: [GC [PSYoungGen: 10433248K->6752K(10454848K)] 10665171K->242331K(20940608K), 0.0091570 secs] [Times: user=0.08 sys=0.00, real=0.01 secs]
Specs on the server above are dual 4-core X5570 with 48 gb of ram. HTOP is showing 15 gb used, the rest cached with the app using a little over 10 gb at any time (tenured is small). This makes me think it's not a paging issue
The other server I refer to is a dual 8-core E5-2690 with 32 gb of ram. Both are running jre 1.7.0_25 on fedora, although the server above is on a very old version