Netty EventExecutorGroup breaks pipeline

Question

The situation: I have a proxy app that uses Netty 4.0.17.Final (FYI: I already had the issue with versions 4.0.13.Final and 4.0.9.Final) and that's based on the proxy from the Netty examples.

The main difference between my code and the example is that my code doesn't connect to the backend server when the channel goes active, but only on the first read as this read must first do some checks on the input before connecting and forwarding that message to the backend server.

I have unit tested and load tested my app for hours and it works fine.

The problem: As the first message received needs to perform some blocking operation I tried to use a separate EventExecutorGroup for that one handler that does that (so that the IO threads don't get blocked):

private static final EventExecutorGroup handlersExecutor = new DefaultEventExecutorGroup(10);
...
pipeline.addLast(handlersExecutor, "authenticationHandler", new FrontendHandler(outboundAddress));

This (= the only change that I made!) breaks the application during load tests. What breaks? XXX out of 3500 client connections report me that YY out of 500 messages for those clients didn't get a reply from the proxy (each request should get one response). An excerpt from the client logs:

2014-02-14 00:39:56.146 [id: 0x34cb2c60] ERROR (com.nsn.ucpsimulator.common.UcpDecoder) - Idle connection (/127.0.0.1:7201). PDUs received: 13

2014-02-14 00:39:56.146 [id: 0xf0955993] ERROR (com.nsn.ucpsimulator.common.UcpDecoder) - Idle connection (/127.0.0.1:7201). PDUs received: 13

2014-02-14 00:39:56.147 [id: 0x9a911fa3] ERROR (com.nsn.ucpsimulator.common.UcpDecoder) - Idle connection (/127.0.0.1:7201). PDUs received: 13

2014-02-14 00:39:56.149 [id: 0x811bbadf] ERROR (com.nsn.ucpsimulator.common.UcpDecoder) - Idle connection (/127.0.0.1:7201). PDUs received: 13

2014-02-14 00:39:56.150 [id: 0x0c4d4c5a] ERROR (com.nsn.ucpsimulator.common.UcpDecoder) - Idle connection (/127.0.0.1:7201). PDUs received: 13

The proxy app tells me that 500 messages were received and were forwarded, but that only 13 replies were received and forwarded back to the client:

2014-02-14 00:39:57.683 [id: 0x39af563b] ERROR (be.demmel.fun.UcpDecoder) - Idle connection (/127.0.0.1:49359). PDUs received: 500

2014-02-14 00:39:57.683 [id: 0x82056d39] ERROR (be.demmel.fun.FrontendHandler) - Idle connection (/127.0.0.1:52004), closing it. PDUs forwarded: 500. Success: 500

2014-02-14 00:40:00.717 [id: 0xcdca8f66] ERROR (be.demmel.fun.UcpDecoder) - Idle connection (/127.0.0.1:7900). PDUs received: 13

2014-02-14 00:40:00.718 [id: 0xcdca8f66] ERROR (be.demmel.fun.BackendHandler) - Idle connection (/127.0.0.1:7900). PDUs forwarded: 13. Success: 13

The server tells me that all is well:

2014-02-14 00:40:02.855 [id: 0x4980be2c] ERROR (com.nsn.ucpsimulator.common.UcpDecoder) - Idle connection (/127.0.0.1:37944). PDUs received: 500

2014-02-14 00:40:02.856 [id: 0x4980be2c] ERROR (com.nsn.ucpsimulator.server.TestUcpHandler) - Idle connection (/127.0.0.1:37944). PDUs sent back: 500

Does somebody know what could cause this?

Additional info:

note that everything works fine until I start using a separate EventExecutorGroup for the blocking handler.
every time XX clients block, they all block at the same amount of replies forwarded to the client.
I have uploaded the netty code here (it's runnable, contains the proxy, server and client apps with a README): https://github.com/AndrewBourgeois/ucp-proxy/tree/master/src/main/java/be/demmel/fun
When the proxy app gets killed, this error pops up at the server side:

java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[na:1.7.0_45]
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[na:1.7.0_45]
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[na:1.7.0_45]
    at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[na:1.7.0_45]
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) ~[na:1.7.0_45]
    at io.netty.buffer.UnpooledUnsafeDirectByteBuf.setBytes(UnpooledUnsafeDirectByteBuf.java:401) ~[netty-all-4.0.9.Final.jar:na]
    at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:869) ~[netty-all-4.0.9.Final.jar:na]
    at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:208) ~[netty-all-4.0.9.Final.jar:na]
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:87) ~[netty-all-4.0.9.Final.jar:na]
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:478) ~[netty-all-4.0.9.Final.jar:na]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:447) ~[netty-all-4.0.9.Final.jar:na]
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:341) ~[netty-all-4.0.9.Final.jar:na]
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:101) [netty-all-4.0.9.Final.jar:na]
    at java.lang.Thread.run(Thread.java:744) [na:1.7.0_45]

I believe that this error indicates that my Netty handlers aren't processing the server replies.

Derek Troy-West Derek Troy-West · Accepted Answer · 2014-02-15T00:03:23

Having a peek at your github project, your execution looks a little like:

--> serve request
  --> authenticate (blocking db call)
    --> forward request
    <-- receive response
<-- serve response

Without a separate EventExecutorGroup all of your execution runs within the NioEventLoopGroup which is supposed to be used only for non-blocking actions. Each request served would decode, and then immediately block on the DB call, so your server is effectively rate-limited to the number of threads in the NioEventLoopGroup.

You have added a DefaultEventExecutorGroup around the ChannelHandler which does the authentication, so now serving requests and authenticating are partly decoupled because each request will be decoded and then execution will be passed to the DEEG, leaving the NioEventLoopGroup to decode more requests.

Except your bootstrap which connects to the DB is configured to use the same NioEventLoopGroup as the initial channel:

b.group(inboundChannel.eventLoop())

Which means that you're still blocking your main netty worker threads with your blocking DB connection.

I'm not sure what happens after that point, but perhaps your serving a bunch of requests (effectively queueing them all up waiting on the DEEG to be available) and then timing them out because they're all waiting on the blocking DB call (which is having it's ability to execute be starved because it's in contention with server decode stuff).

i.e. (Assuming you have plenty of concurrent clients)

[original, 2 thread NioEventLoopGroup, no EventExecutorGroup]

nio-thread-1: serve-request 1 and authenticate (block)
nio-thread-2: serve-request 2 and authenticate (block)

(db calls completes)

nio-thread-1: forward-request 1 (non-blocking)
nio-thread-2: forward-request 2 (non-blocking)

nio-thread-1: serve-request 3 and authenticate (block)
nio-thread-2: serve-request 4 and authenticate (block)

(db calls complete)

nio-thread-1: forward-request 3 (non-blocking)
nio-thread-2: forward-request 4 (non-blocking)

nio-thread-1: either serve-response 1/2 or serve-request 5 (and block)
nio-thread-2: either serve-response 1/2 or serve-request 6 (and block)

It's not pretty, but you're limited to processing roughly n*2 requests at a time, assuming server-requests and server-responses are handled with the same urgency.

[2 thread NioEventLoopGroup, 2 thread DefaultEventExecutorGroup]

nio-thread-1: serve-request 1 and pass to DEEG
nio-thread-2: serve-request 2 and pass to DEEG
nio-thread-1: serve-request 3 and pass to DEEG
nio-thread-2: serve-request 4 and pass to DEEG
nio-thread-1: serve-request 5 and pass to DEEG
nio-thread-2: serve-request 6 and pass to DEEG
nio-thread-1: serve-request 7 and pass to DEEG
nio-thread-2: serve-request 8 and pass to DEEG

def-evt-eg-1: try to authenticate, pass execution back to nio-thread-x
def-evt-eg-2: try to authenticate, pass execution back to nio-thread-x

nio-thread-1: serve-request 9 and pass to DEEG
nio-thread-2: serve-request 10 and pass to DEEG
nio-thread-1: serve-request 11 and pass to DEEG
nio-thread-2: serve-request 12 and pass to DEEG
nio-thread-1: authenticate against DB (block)
nio-thread-2: serve-request 12 and pass to DEEG
nio-thread-2: serve-request 13 and pass to DEEG
nio-thread-2: serve-request 14 and pass to DEEG
nio-thread-2: serve-request 15 and pass to DEEG
nio-thread-2: authenticate against DB (block)

Now you're able to consume more requests, but the rate at which you make DB calls and total latency through your server will depend on the number of concurrent clients you have, the number of DEEG threads v number of NioEventLoop threads, context switching, etc.

You can probably visualize this by printing out some basic thread diagnostics as you run your app. I might be completely wrong as I didn't have a chance to run it and see for myself, this is just my guess.

Netty EventExecutorGroup breaks pipeline

2 Answers