Our backend server is using Embedded Jetty 8.1.15 for few years now. Did not have any issues with it until recently, when we started doing load test scenario with a lot of concurrent users. Then we succeeded reproducing the issue even with small amount of users using JMeter (HTTP Sampler with concurrent pool of 1000 and KeepAlive). The communication between client and server is over TLS (with connection pooling on the client side) Client (connection pooling) -> TLS -> Server The behavior we see - at some point of time, a lot of threads are "stuck" (waiting on monitor on different methods) with the following stacktraces -
Thread "qtp438123546-99":
at java.security.SecureRandom.nextBytes(byte[ ]) (line: 457)
at sun.security.ssl.RandomCookie.<init>(java.security.SecureRandom) (line: 53)
at sun.security.ssl.ServerHandshaker.clientHello(sun.security.ssl.HandshakeMessage$ClientHello) (line: 522)
at sun.security.ssl.ServerHandshaker.processMessage(byte, int) (line: 213)
at sun.security.ssl.Handshaker.processLoop() (line: 925)
at sun.security.ssl.Handshaker$1.run() (line: 865)
at sun.security.ssl.Handshaker$1.run() (line: 862)
at java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction, java.security.AccessControlContext)
at sun.security.ssl.Handshaker$DelegatedTask.run() (line: 1302)
at org.eclipse.jetty.io.nio.SslConnection.process(org.eclipse.jetty.io.Buffer, org.eclipse.jetty.io.Buffer) (line: 375)
at org.eclipse.jetty.io.nio.SslConnection.access$900(org.eclipse.jetty.io.nio.SslConnection, org.eclipse.jetty.io.Buffer, org.eclipse.jetty.io.Buffer) (line: 48)
at org.eclipse.jetty.io.nio.SslConnection$SslEndPoint.fill(org.eclipse.jetty.io.Buffer) (line: 678)
at org.eclipse.jetty.http.HttpParser.fill() (line: 1044)
at org.eclipse.jetty.http.HttpParser.parseNext() (line: 280)
at org.eclipse.jetty.http.HttpParser.parseAvailable() (line: 235)
at org.eclipse.jetty.server.AsyncHttpConnection.handle() (line: 82)
at org.eclipse.jetty.io.nio.SslConnection.handle() (line: 196)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle() (line: 696)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run() (line: 53)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(java.lang.Runnable) (line: 608)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run() (line: 543)
at java.lang.Thread.run() (line: 745)
OR sometimes on other Java Security synchronized API methods
Thread "qtp438123546-993":
at sun.security.ssl.SignatureAndHashAlgorithm.getSupportedAlgorithms(java.security.AlgorithmConstraints) (line: 155)
at sun.security.ssl.Handshaker.getLocalSupportedSignAlgs() (line: 422)
at sun.security.ssl.ServerHandshaker.clientHello(sun.security.ssl.HandshakeMessage$ClientHello) (line: 700)
at sun.security.ssl.ServerHandshaker.processMessage(byte, int) (line: 213)
at sun.security.ssl.Handshaker.processLoop() (line: 925)
at sun.security.ssl.Handshaker$1.run() (line: 865)
at sun.security.ssl.Handshaker$1.run() (line: 862)
at java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction, java.security.AccessControlContext)
at sun.security.ssl.Handshaker$DelegatedTask.run() (line: 1302)
at org.eclipse.jetty.io.nio.SslConnection.process(org.eclipse.jetty.io.Buffer, org.eclipse.jetty.io.Buffer) (line: 375)
at org.eclipse.jetty.io.nio.SslConnection.access$900(org.eclipse.jetty.io.nio.SslConnection, org.eclipse.jetty.io.Buffer, org.eclipse.jetty.io.Buffer) (line: 48)
at org.eclipse.jetty.io.nio.SslConnection$SslEndPoint.fill(org.eclipse.jetty.io.Buffer) (line: 678)
at org.eclipse.jetty.http.HttpParser.fill() (line: 1044)
at org.eclipse.jetty.http.HttpParser.parseNext() (line: 280)
at org.eclipse.jetty.http.HttpParser.parseAvailable() (line: 235)
at org.eclipse.jetty.server.AsyncHttpConnection.handle() (line: 82)
at org.eclipse.jetty.io.nio.SslConnection.handle() (line: 196)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle() (line: 696)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run() (line: 53)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(java.lang.Runnable) (line: 608)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run() (line: 543)
at java.lang.Thread.run() (line: 745)
When everything is fine, before the problem appears, there is a connection pooling with between client and server with persistent connections (that can be seen in netstat), but when the problem appears there are lot of connections in different states, other that ESTABLISHED:
On the client side:
tcp6 0 0 client-host:39962 server-host:443 FIN_WAIT2
tcp6 0 1 client-host:45266 server-host:443 SYN_SENT
tcp6 0 0 client-host:46234 server-host:443 FIN_WAIT2
tcp6 0 0 client-host:38892 server-host:443 FIN_WAIT2
tcp6 0 0 client-host:39160 server-host:443 FIN_WAIT2
tcp6 0 0 client-host:39188 server-host:443 FIN_WAIT2
tcp6 0 1 client-host:43496 server-host:443 SYN_SENT
tcp6 0 0 client-host:46122 server-host:443 FIN_WAIT2
tcp6 0 1 client-host:44938 server-host:443 SYN_SENT
tcp6 0 0 client-host:46446 server-host:443 ESTABLISHED
On the server side:
tcp 0 2980 server-host:443 client-host-1:34964 LAST_ACK
tcp 0 2980 server-host:443 client-host-3:52430 LAST_ACK
tcp 0 2980 server-host:443 client-host-1:35922 LAST_ACK
tcp 0 0 server-host:443 client-host-1:38362 CLOSE_WAIT
tcp 236 0 server-host:443 client-host-3:58296 CLOSE_WAIT
tcp 0 2980 server-host:443 client-host-1:34980 LAST_ACK
tcp 0 2980 server-host:443 client-host-2:55748 LAST_ACK
tcp 0 2980 server-host:443 client-host-3:53376 LAST_ACK
tcp 0 0 server-host:443 client-host-1:40104 SYN_RECV
tcp 0 0 server-host:443 client-host-1:38718 CLOSE_WAIT
tcp 0 2980 server-host:443 client-host-2:54142 LAST_ACK
tcp 0 2980 server-host:443 client-host-1:50766 LAST_ACK
tcp 0 0 server-host:443 client-host-1:38604 CLOSE_WAIT
tcp 236 0 server-host:443 client-host-3:57604 CLOSE_WAIT
tcp 0 2980 server-host:443 client-host-2:55502 LAST_ACK
tcp 0 2980 server-host:443 client-host-2:58254 LAST_ACK
tcp 0 2980 server-host:443 client-host-1:38042 LAST_ACK
tcp 0 0 server-host:443 client-host-1:38222 CLOSE_WAIT
tcp 0 2980 server-host:443 client-host-3:47812 LAST_ACK
tcp 0 2980 server-host:443 client-host-1:60532 LAST_ACK
tcp 0 2980 server-host:443 client-host-2:54282 LAST_ACK
tcp 0 0 server-host:443 client-host-1:40978 SYN_RECV
Almost all the threads are unresponsive, CPU is very high and GC is consistently working
We also set the following flag into the JVM:
-Djava.security.egd=file:/dev/./urandom
In order the SecureRandom will be non-blocking (as opposed to /dev/random)
java version "1.8.0_05" Java(TM) SE Runtime Environment (build 1.8.0_05-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)
Kernel: 4.14.94-89.73.amzn2.x86_64 (but the problem is also appears on system with Kernel 2.6.32-696.20.1.el6.x86_64)
limits:
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 151551
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 65536
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 16384
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
When a component enters that state, it is unresponsive, the logs are not being written to the log file (btw, we use log4j2)
When a load on a component is stopped, it takes few minutes for component to recover and it becomes responsive again
Have anyone had similar behaviors in your Java Backend Components? Please comment/suggest a direction for investigation and/or solution