24
votes

I am running into a very peculiar issue. My tomcat runs perfectly at about 25% CPU 24/7, but some days my CPU shoots up to 60% and the system grinds to a halt and fails to recover.

When I take a thread dump during the slow down almost all the threads are busy with some kind of String or related operation.

There are no OutOfMemory errors or any exceptions being thrown, all requests are still handled but response times deteriorate to the nth degree where even a sub second request slows down to take 60 seconds and more.

My server config is as follows:

    Ubuntu 12.04.2 LTS
    Linux 3.2.0-38-virtual #60-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux
    java version "1.7.0_13"
    Java(TM) SE Runtime Environment (build 1.7.0_13-b20)
    Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)
    export JAVA_OPTS='-server
    -Xms18g -Xmx18g
    -XX:MaxPermSize=512m
    -XX:ThreadStackSize=512
    -XX:NewRatio=1
    -XX:SurvivorRatio=4
    -XX:+UseConcMarkSweepGC
    -XX:+UseParNewGC
    -XX:+CMSClassUnloadingEnabled
    -Xloggc:/usr/tomcat/logs/gc.log
    -XX:+PrintGCDetails
    -XX:+PrintGCDateStamps
    -XX:+PrintTenuringDistribution
    -Dcom.sun.management.jmxremote
    -Dcom.sun.management.jmxremote.port=9999
    -Dcom.sun.management.jmxremote.authenticate=false
    -Dcom.sun.management.jmxremote.ssl=false
    -Djava.awt.headless=true'

Click here to download thread dump. I have removed the bulk of the threads and their stackTraces

Click here to download vmstat log

Click here to download gc log

Any ideas as to the cause of this? Thanks

9
Do you get more requests at those times? Do you have network graphs to compare to? - Brendan Long
Any chance this being stop-the-world GC? Do you log gc events? - mindas
maybe your app is happy running with a set of usual chars, then once in a blue moon an unusual char shows up, which somehow disturbed the vm optimization. you can test this theory by feeding different chars and see if any char would trigger the problem. if it's true it should be a vm bug. - irreputable
Most threads are executing String.toUpperCase(). Unfortunately the stack traces do not show what code calls this method. Did you truncute the stack traces? - gogognome
Regarding "I have removed the bulk of the threads and their stackTraces," just give us everything. Otherwise you might delete an important detail. If you wish to hide some information, then abbreviate the package or rename the Class.method(). - jtoberon

9 Answers

3
votes

To try to pinpoint the offending requests you may configure Stuck Thread Detection Valve in Tomcat.

This valve allows to detect requests that take a long time to process, which might indicate that the thread that is processing it is stuck.

When such a request is detected, the current stack trace of its thread is written to Tomcat log with a WARN level.

The IDs and names of the stuck threads are available through JMX in the stuckThreadIds and stuckThreadNames attributes. The IDs can be used with the standard Threading JVM MBean (java.lang:type=Threading) to retrieve other information about each stuck thread.

3
votes

If the CPU utilisation is below 100% and yet the app has ground to a halt, this implies that something is preventing the CPU from being fully utilized.

I/O or excessive context switching (e.g. caused by locks) are the usual culprits for this.

Can you post the output from vmsstat 1 during one of the events? - the next step in diagnosis is to disambiguate whether context switching is the issue here.

3
votes

This is not a memory issue, as in your dump the GC isn't even busy and there is enough memory available. In addition the CPU is stuck at 60%, but if the application would be busy calculating (GC or something else) it would be stuck at 100%, same if this was a network attack. So the source of this problem must include some disk-IO operation.

Tomcat is known to be buggy and has several severe issues. One thing that I encountered is that for no particular reason, Tomcat suddenly flooded its own logfile with nonsense-entries. This not only caused the disk to fill to 100% but as well dramatically slowed down incoming requests. You can check this by having a look at the tomcat logs and its size.

If this is not the source, you should use the available tools to check for any strange disk-IO of Tomcat and proceed from there.

3
votes

I think your problem is this configuration decision -XX:PermSize=320m -XX:MaxPermSize=320m by not allow your PemSpace to dynamically change, you're causing a deadlock when you exhaust it - remember that the intern cache uses PermSpace. I would try changing -XX:MaxPermSize=320m to -XX:MaxPermSize=512m.

3
votes

Try increasing the maximum size of the Code Cache with the following JVM option:

-XX:ReservedCodeCacheSize=256m

See my answer to another question for the background to this suggestion.

2
votes

Are there any anomalities in the GC log ? It seems like you are running with a pretty big heap with some unusual options and do a lot of String allocating stuff. Maybe you suffer from Heap fragmentation over time (CMS does not compact). Also ensure no swapping is going on (may happen if heap is way too large, so rarely visited by VM)

I'd suspect this to be GC related, as apparently there is no thread blocked. Have you tried with a more recent JDK ? Also you might retry but remove somewhat uncommon option -XX:+CMSScavengeBeforeRemark, because those might have not much test coverage with each minor JDK release.

Another suspicion might be incoming requests using strange charsets (kyrillic or arabic) leading to massive Charset mapping overhead. Also check if there is some robot on your page, any suspicious requests coming in ? You definitely need longer stacktraces to find out the root operation of string manipulation.

1
votes

You need to use BTrace diagnosing the method invoking.

Write a breace script like this:

Trace com.xx.xx prefix class who invoke String's any method, and print invoke times.

@TLS
private static Map<String, Integer> countMap = BTraceUtils.newHashMap();

private static String prefix = "com.xx.xx";// package like com.xx.xx which you want to trace ()

@OnMethod(clazz = "java.lang.String", method = "/.*/") //all method in String
public static void traceMethodInvoke() {
    String str = BTraceUtils.jstackStr();
    for (String currentClass : str.split("\\n")) {
        if (BTraceUtils.Strings.startsWith(currentClass, prefix)) {
            if (!countMap.containsKey(currentClass)) {
                countMap.put(currentClass, 1);
            } else {
                countMap.put(currentClass, countMap.get(currentClass) + 1);
            }
            break;
        }
    }
}

@OnTimer(5000)
public static void print() {
    BTraceUtils.println("========================================");
    for (Map.Entry<String, Integer> entry : countMap.entrySet()) {
        if (entry.getValue() > 100) {// print if cont > 10
            BTraceUtils.println(entry.getValue() + "\t\t" + entry.getKey());
        }
    }
    BTraceUtils.println("===========================================");

}  

The result output like this:

====================================================
1022                           com.xx.xx.classA#m1
322                            com.xx.xx.classA#m2
2022                           com.xx.xx.classA#m21
422                            com.xx.xx.ccc.classX#m11
522                            com.xx.xx.zz.classS#m44
.........

You can change the prefix for tracing another package prefix.

According the result, you can analyze the source code and find out the problems.

1
votes

Having scanned through the thread dump looking at the RUNNABLE threads, one things stands out. Your system seems to be processing / trying to process a large number of requests at the same time. And unless you have a number of cores, there is likely to be a lot of time slicing. On the other hand, I can see no clear >>evidence<< that this is GC related. (But you didn't include the GC logs ...)

I suggest that you look at two things.

  • Look at the operating system's virtual memory statistics. One possible cause of catastrophic system slowdown is virtual memory thrashing. This is where the total application demand for virtual memory pages exceeds the physical memory available ... and the operating system spends a lot of time swapping pages between physical memory and swap disc / page file.
  • Look at the pattern of requests you are getting. It could be that at certain times the number / type of requests you are getting simply exceeds your system's capacity.

If the problem is VM thrashing, then the solution is to reduce application memory demand. The simple way to do this is to reduce the Java heap size.

If the problem is load, then that is harder to solve:

  • You could try beefing up your hardware (or adding more VCPUs to your virtual).
  • You could try spreading the load over multiple instances of your server.
  • You could try reducing the number of worker threads so that your server doesn't try to handle so many requests at a time.
  • You could try profiling the application and/or analysing the request stats to see if there are hotspots that could be tuned, or expensive functionality that could be turned off ...

Finally, you could see if it helps to switch from CMS to the Parallel Collector; refer to the Oracle GC Tuning page: Available Collectors. But I doubt that this is a GC problem.

0
votes

The first thing that you should proceed is on finding out which threads are actually consuming the CPU. It could be the threads which on performing string operations or it could be other VM thread which may be performing GC & Sweep operations. The link says how to co-relate the CPU spikes with the thread dump

Once you can pin point the threads it would be more clear as to what should be next step forward.

Hope this helps