0
votes

Question Descption

Our Java application run in docker, which use Java 8 and 6C, 8g.

Garbage collection algorithm use parNew + CMS, but almost never touch off olg gen gc.

Framework: Spring + mybatis + dubbo + RocketMQ.

Here is the JVM params:

-server -Xmx5g -Xms5g -Xmn1g -XX:MetaspaceSize=512m -XX:MaxMetaspaceSize=512m -Xss256k -XX:SurvivorRatio=8 -XX:+PrintGCDetails -Xloggc:/opt/apps/logs/gc.log -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 -XX:+PrintReferenceGC -XX:+UnlockDiagnosticVMOptions -XX:-DisplayVMOutput -XX:+LogVMOutput -XX:LogFile=/opt/apps/logs/safepoint.log -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -XX:+UseCMSCompactAtFullCollection -XX:+UseFastAccessorMethods -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/apps/logs

I spend two weeks on this problem, but can not find out what cause it. Firstly, I open the GC log and safe point log, doubted the safe point may be the cause. But I found spin time + block time was very short. Only when the young gc happened, the vmop time almost the same as the young gc stopped time.

And then I check the FinalRefence collection, and add the PrintReferenceGC. It was very short too.

Here is the gc log

2019-09-20T03:03:37.816+0800: 23271.179: [GC (Allocation Failure) 2019-09-20T03:03:37.818+0800: 23271.180: [ParNew2019-09-20T03:03:39.394+0800: 23272.756: [SoftReference, 0 refs, 0.0000626 secs]2019-09-20T03:03:39.394+0800: 23272.756: [WeakReference, 92 refs, 0.0000400 secs]2019-09-20T03:03:39.394+0800: 23272.756: [FinalReference, 55 refs, 0.0001404 secs]2019-09-20T03:03:39.394+0800: 23272.756: [PhantomReference, 0 refs, 0 refs, 0.0000298 secs]2019-09-20T03:03:39.394+0800: 23272.756: [JNI Weak Reference, 0.0000552 secs]: 845514K->6643K(943744K), 1.5767273 secs] 1160791K->322039K(5138048K), 1.5784758 secs] [Times: user=5.21 sys=0.17, real=1.57 secs] 2019-09-20T03:03:39.396+0800: 23272.758: Total time for which application threads were stopped: 1.5826799 seconds, Stopping threads took: 0.0005805 seconds

Here is the safe point log

vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count 23271.176: GenCollectForAllocation [ 1699 0 2 ] [ 0 0 0 2 1578 ] 0

2
Can you upload a complete GC log somewhere? Have you tried G1GC?the8472
you can preview it on oms.zb2088.com/static/gc.txt. We have no plan to tried G1 GC in production environment.张国彬
Does container run on bare metal or virtual host. Resource starving virtualization host could create 'phantom' CPU usage on guest OS.Alexey Ragozin
Run in docker. I am going to work with docker problem, and collect some system monitor data.张国彬
Problem was solved. We found out the JVM in docker got the wrong CPU core number. The number which was from bare metal but not the docker config. I added -XX:ParallelGCThreads=6 to control parNew gc thread num. The GC time reduced to 50ms, and it was stabled.张国彬

2 Answers

1
votes

As can be seen from the remaining heap occupancy after a young collection the live set size (which includes young and old objects) of your application seems to be < 300MB. Your total heap size is 5GB, so there is lots of breathing room.

And since there is breathing room and the young pause time is a concern you can attempt shrinking the young generation, which may slightly increase tenuring rate, but the CMS old gen collector should be able to keep up with that.

From the GC logs you provided:

2019-09-20T07:56:55.968+0800: 40869.330: [GC (Allocation Failure) 2019-09-20T07:56:55.969+0800: 40869.331: [ParNew2019-09-20T07:56:56.006+0800: 40869.368: [SoftReference, 0 refs, 0.0000588 secs]2019-09-20T07:56:56.006+0800: 40869.368: [WeakReference, 114 refs, 0.0000369 secs]2019-09-20T07:56:56.006+0800: 40869.368: [FinalReference, 22 refs, 0.0000404 secs]2019-09-20T07:56:56.006+0800: 40869.368: [PhantomReference, 0 refs, 0 refs, 0.0000298 secs]2019-09-20T07:56:56.006+0800: 40869.368: [JNI Weak Reference, 0.0000626 secs]: 842095K->3179K(943744K), 0.0378130 secs] 1170641K->331912K(5138048K), 0.0394628 secs] [Times: user=0.70 sys=0.42, real=0.04 secs] 
2019-09-20T07:58:44.545+0800: 40977.907: [GC (Allocation Failure) 2019-09-20T07:58:44.545+0800: 40977.908: [ParNew2019-09-20T07:58:45.798+0800: 40979.160: [SoftReference, 0 refs, 0.0000641 secs]2019-09-20T07:58:45.798+0800: 40979.160: [WeakReference, 116 refs, 0.0000436 secs]2019-09-20T07:58:45.798+0800: 40979.160: [FinalReference, 14 refs, 0.0000217 secs]2019-09-20T07:58:45.798+0800: 40979.160: [PhantomReference, 0 refs, 0 refs, 0.0000293 secs]2019-09-20T07:58:45.798+0800: 40979.160: [JNI Weak Reference, 0.0000510 secs]: 842091K->3159K(943744K), 1.2526309 secs] 1170824K->331976K(5138048K), 1.2541209 secs] [Times: user=6.73 sys=0.66, real=1.26 secs] 

These two collections do a similar amount of work and yet it takes 10x the CPU cycles and 30x wall time. This usually indicates a cause outside the JVM, i.e. other processes or (as Alexey suggested) sibling VMs contending for CPU or memory bandwidth resources. CPU thermal throttling may also be an issue but that's less common in server environments.

0
votes

Problem was solved. We found out the JVM in docker got the wrong CPU core number. The number which was from bare metal but not the docker config. I added -XX:ParallelGCThreads=6 to control parNew gc thread num. The GC time reduced to 50ms, and it was stabled.