0
votes

I am new to Netty, and it seems pretty heavyweight to me, so I decided to do some research on its performance. I was trying to implement a server that does the following:

  1. Receive UDP datagrams from some sources (many UDP ports).
  2. Listen for TCP connections.
  3. Send the received datagrams to the connected TCP clients.

The number of clients is pretty small, I used 10 in my test, but UDP streams are pretty heavy - 50 streams around 200 kbyte/sec each, which makes it around 10 MB/sec. I imitated those streams with a single threaded application that sends 200 packets of 1440 bytes each (4 packets to each port), then sleeps for 28 ms and so on (which really gave me about 8000 kB/s, I guess because of high load and inaccurate sleep times).

Now, I figure out that it's not a terribly high load, but my PC is pretty slow too - some old 2-core Intel E4600. Windows 7 x64 on board.

I start three programs: the sender (imitator), the server and the client. All on the same machine, which I guess isn't the best way to test it, but at least it should allow me to compare how different implementations of servers work with the same imitator and client.

The packet structure looks like this: 8 bytes timestamp, 8 bytes packet number (starting with 0), 1 byte port identifier and 1 byte "substream" identifier. The idea is that each of 50 ports has 4 substreams in it, so I have actually 200 independent stream of packets grouped into 50 UDP streams.

The results were somewhat surprising. With a plain old thread-per-client server I got around 7500 kB/s throughput with very little packet loss. And it was actually two threads per client (another one blocked on read() just in case a client sends something, which it doesn't) and 50 threads for UDP receiving. The CPU load was around 60%.

With an OIO Netty server I get around 6000 kB/s at the client side and I get a lot of packet losses. And that with the low water mark set to 50 MB and the high one to 100 MB! The CPU load is 80% which isn't a good sign too.

With a NIO Netty server I get around 4500 kB/s but with no losses for some inexplicable reason. Maybe it was slowing down my sender process? But it makes no sense: the CPU load was around 60% and NIO isn't supposed to use a lot of threads that could hinder sender scheduling...

Here's my Netty server implementation:

public class NettyServer {

  public static void main(String[] args) throws Exception {
    new NettyServer(Integer.parseInt(args[0])).run();
  }
  private final int serverPort;

  private NettyServer(int serverPort) {
    this.serverPort = serverPort;
  }

  private void run() throws InterruptedException {
    boolean nio = false;
    EventLoopGroup bossGroup;
    EventLoopGroup workerGroup;
    EventLoopGroup receiverGroup;
    if (nio) {
      bossGroup = new NioEventLoopGroup();
      workerGroup = new NioEventLoopGroup();
      receiverGroup = new NioEventLoopGroup();
    } else {
      bossGroup = new OioEventLoopGroup();
      workerGroup = new OioEventLoopGroup();
      receiverGroup = new OioEventLoopGroup();
    }
    final List<ClientHandler> clients
            = Collections.synchronizedList(new LinkedList<ClientHandler>());
    ServerBootstrap server = new ServerBootstrap();
    server.group(bossGroup, workerGroup).channel(
            nio ? NioServerSocketChannel.class : OioServerSocketChannel.class)
            .childHandler(new ChannelInitializer<SocketChannel>() {
      @Override
      protected void initChannel(SocketChannel ch) throws Exception {
        ch.config().setWriteBufferHighWaterMark(1024 * 1024 * 100);
        ch.config().setWriteBufferLowWaterMark(1024 * 1024 * 50);
        final ClientHandler client = new ClientHandler(clients);
        ch.pipeline().addLast(client);
      }
    });
    server.bind(serverPort).sync();
    Bootstrap receiver = new Bootstrap();
    receiver.group(receiverGroup);
    receiver.channel(nio ? NioDatagramChannel.class : OioDatagramChannel.class);
    for (int port = 18000; port < 18000 + 50; ++port) {
      receiver.handler(new UDPHandler(clients));
      receiver.bind(port).sync();
    }
  }
}

class UDPHandler extends SimpleChannelInboundHandler<DatagramPacket> {
  private final Collection<ClientHandler> clients;
  private static final long start = System.currentTimeMillis();
  private static long sum = 0;
  private static long count = 0;
  private final Long[][] lastNum = new Long[50][4];

  public UDPHandler(Collection<ClientHandler> clients){
    this.clients = clients;
  }

  @Override
  protected void channelRead0(ChannelHandlerContext ctx, DatagramPacket msg) throws Exception {
    final ByteBuf content = msg.content();
    final int length = content.readableBytes();
    synchronized (UDPHandler.class) {
      sum += length;
      if (++count % 10000 == 0) {
        final long now = System.currentTimeMillis();
        System.err.println((sum / (now - start)) + " kB/s");
      }
    }
    long num = content.getLong(8);
    // this basically identifies the sender port
    // (0-50 represents ports 18000-18050)
    int nip = content.getByte(16) & 0xFF;
    // and this is "substream" within one port (0-3)
    int stream = content.getByte(17) & 0xFF;
    // the last received number for this nip/stream combo
    Long last = lastNum[nip][stream];
    if (last != null && num - last != 1) {
      // number isn't incremented by 1, so there's packet loss
      System.err.println("lost " + (num - last - 1));
    }
    lastNum[nip][stream] = num;
    synchronized (clients) {
      for (ClientHandler client : clients) {
        final ByteBuf copy = content.copy();
        client.send(copy);
      }
    }
  }

}

public class ClientHandler extends ChannelInboundHandlerAdapter {

  private final static Logger logger
          = Logger.getLogger(ClientHandler.class.getName());

  private ByteBuf buffer;
  private final Collection<ClientHandler> clients;
  private Channel channel;

  ClientHandler(Collection<ClientHandler> clients) {
    this.clients = clients;
  }

  @Override
  public void handlerAdded(ChannelHandlerContext ctx) throws Exception {
    channel = ctx.channel();
    clients.add(this);
  }

  @Override
  public void handlerRemoved(ChannelHandlerContext ctx) throws Exception {
    clients.remove(this);
  }

  @Override
  public void exceptionCaught(ChannelHandlerContext ctx, Throwable cause) throws Exception {
    if (!(cause instanceof IOException)) {
      logger.log(Level.SEVERE, "A terrible thing", cause);
    }
  }

  void send(ByteBuf msg) {
    if (channel.isWritable()) {
      channel.writeAndFlush(msg);
    } else {
      msg.release();
    }
  }

}

Profiling shows that my trivial server implementation spends around 83% in blocking UDP reads, 12% waiting for locks (if that's what sun.misc.Unsafe.park() does) and around 4.5% in blocking TCP writes.

OIO server spends around 75% in blocking UDP reads, 11% in blocking TCP reads (why?), 6% in my UDP handler (why that much?) and 4% in blocking TCP writes.

NIO server spends 97.5% in selection, which should be a good sign. The lack of losses is a good sign too and with CPU load being the same as my trivial server, it would seem that everything is fine, only if the throughput wasn't almost 2 times as slow!

So here are my questions:

  1. Is Netty going to be effective for a task like this or is it only good for large number of connections/requests?
  2. Why OIO implementation eats so much CPU and loses packets? What's so different from plain old 2-threads-per-client? I doubt that it's only because of some overhead caused by utility data structures such as pipelines.
  3. What on Earth happens when I switch to NIO? How it's possible to slow down but not loose any packets? I would definitely thought that there's something wrong with my code, but it seems to get all 8000 kB/s of traffic if I only switch to OIO without modifying anything. So is there a bug in my code that only happens with NIO?
1
OIO? LOL. it's a funny name.ZhongYu
Interesting indeed. Filed an issue for a follow-up: github.com/netty/netty/issues/1820trustin

1 Answers

1
votes

For network heavy tasks, the network bandwidth is usally the issue. For a TCP connection on 100 Mb/s you can get up to 11 MB/s with full utilisation, but you will get better results for less than 50% utilisation i.e. about 5 MB/s or less. UDP is very sensitive to the buffers you have in your router and network adapters. Unless you have specialist hardware you can expect network drop outs for more than about 30% utilisation. Ideally you would have a dedicated network for UDP to avoid buffers overflowing.

In short, your numbers are realistic for a 100 Mb/s network. If you have a 1+ Gb network all the way and decent network routers, I would expect much more.