Should we use multiple acceptor sockets to accept a large number of connections?

Question

As known, SO_REUSEPORT allows multiple sockets to listen on the same IP address and port combination, it increases requests per second by 2 to 3 times, and reduces both latency (~30%) and the standard deviation for latency (8 times): https://www.nginx.com/blog/socket-sharding-nginx-release-1-9-1/

NGINX release 1.9.1 introduces a new feature that enables use of the SO_REUSEPORT socket option, which is available in newer versions of many operating systems, including DragonFly BSD and Linux (kernel version 3.9 and later). This socket option allows multiple sockets to listen on the same IP address and port combination. The kernel then load balances incoming connections across the sockets. ...

As shown in the figure, reuseport increases requests per second by 2 to 3 times, and reduces both latency and the standard deviation for latency.

SO_REUSEPORT is available on most modern OS: Linux (kernel >= 3.9 since 29 Apr 2013), Free/Open/NetBSD, MacOS, iOS/watchOS/tvOS, IBM AIX 7.2, Oracle Solaris 11.1, Windows (is only SO_REUSEPORT that behaves as 2 flags together SO_REUSEPORT+SO_REUSEADDR in BSD), and may be on Android : https://stackoverflow.com/a/14388707/1558037

Linux >= 3.9

Additionally the kernel performs some "special magic" for SO_REUSEPORT sockets that isn't found in other operating systems: For UDP sockets, it tries to distribute datagrams evenly, for TCP listening sockets, it tries to distribute incoming connect requests (those accepted by calling accept()) evenly across all the sockets that share the same address and port combination. Thus an application can easily open the same port in multiple child processes and then use SO_REUSEPORT to get a very inexpensive load balancing.

Also known, to avoid locks of spin-lock and achive high performance there shouldn't be sockets which read more than 1 thread. I.e. each thread should processes its own sockets for read/write.

accept() is thread-safe function for the same socket descriptor, so it should be guarded by lock - so lock contention reduces performance: http://unix.derkeiler.com/Newsgroups/comp.unix.programmer/2007-06/msg00246.html

POSIX.1-2001/SUSv3 requires accept(), bind(), connect(), listen(), socket(), send(), recv(), etc. to be thread-safe functions. It's possible that there are some ambiguities in the standard regarding their interaction with threads, but the intention is that their behaviour in multithreaded programs is governed by the standard.

If we use the same one socket from many threads, then performance will be low because socket defended by lock for thread-safe accessing from many threads: https://blog.cloudflare.com/how-to-receive-a-million-packets/

The receiving performance is down compared to a single threaded program. That's caused by a lock contention on the UDP receive buffer side. Since both threads are using the same socket descriptor, they spend a disproportionate amount of time fighting for a lock around the UDP receive buffer. This paper describes the problem in more detail.

More details about spin-lock when the application tries to read data from the socket - "Analysis of Linux UDP Sockets Concurrent Performance": http://www.jcc2014.ucm.cl/jornadas/WORKSHOP/WSDP%202014/WSDP-4.pdf

V. K ERNEL ISOLATION

....

From the other side, when the application tries to read data from the socket, it executes a similar process, which isdescribed below and represented in Figure 3 from right to left:

1) Dequeue one or more packets from the receive queue, using the corresponding spinlock (green one).

2) Copy the information to user-space memory.

3) Release the memory used by the packet. This potentiallychanges the state of the socket, so two ways of locking the socket can occur: fast and slow. In both cases, the packet is unlinked from the socket, Memory Accounting statistics are updated and socket is released according to the locking path taken.

I.e. when many threads are accessing the same socket, performance degrades due to waiting on one spin-lock.

We have 2 x Xeon 32 HT-Cores server with 64 total HT-cores, and two 10 Gbit Ethernet cards, and Linux (kernel 3.9).

We use RFS and XPS - i.e. for the same connection TCP/IP-stack processed (kernel-space) on the same CPU-Core as an application thread (user-space).

There are at least 3 ways to accept connections to processes it at many threads:

Use one acceptor socket shared between many threads, and each thread accept connections and processes it
Use one acceptor socket in 1 thread, and this thread push received socket descriptors of connections to other thread-workers by using thread-safe queue
Use many acceptor sockets which listen the same ip:port, 1 individual acceptor socket in each thread, and the thread that receives the connection then processes it (recv/send)

What is the more efficient way, If we accept a lot of new TCP-connections?

Cloud Cloud · Accepted Answer · 2017-07-09T22:09:48

Having had to handle such an occasion in production, here's a good way to approach this problem:

First, setup a single thread to handle all incoming connections. Modify the affinity map so that this thread has a dedicated core that no other threads in your application (or even your entire system) will try to access. You can also modify your boot scripts so that certain cores are never automatically assigned to an execution unit unless that specific core is explicitly requested (i.e. isolcpus kernel boot parameters).

Mark that core as un-used, and then explicitly request it in your code for the "listen to socket" thread via cpuset.

Next, setup a queue (ideally, a priority queue) that prioritizes write operations (i.e. "the second readers-writers problem). Now, setup however many worker threads as you see reasonable.

At this point, the goal of the "incoming connections" thread should be to:

accept() incoming connections.
Pass these connection file descriptors (FDs) off to your writer-prioritized queue structure as quickly as possible.
Go back to its accept() state as quickly as possible.

This will allow you to delegate incoming connections as quickly as possible. Your worker threads can grab items from the shared queue as they arrive. It might also be worth having a second, high-priority thread that grabs data from this queue, and moves it to a secondary queue, saving the "listen to socket" thread from having to spend extra cycles delegating client FDs.

This would also prevent the "listen to socket" thread and the worker threads from ever having to access the same queue concurrently, which would save you from worst-case scenarios like a slow worker thread locking the queue when the "listen to socket" thread wants to drop data in it. i.e.

Incoming client connections

 ||
 || Listener thread - accept() connection.
 \/

Listener/Helper queue

 ||
 || Helper thread
 \/

Shared Worker queue

 ||
 || Worker thread #n
 \/

Worker-specific memory space. read() from client.

As for your other two proposed options:

Use one acceptor socket shared between many threads, and each thread accept connections and processes it.

Messy. The threads will have to somehow take turns issuing the accept() call, and there won't be any benefit to doing this. You'll also have some additional sequencing logic to handle which thread's "turn" is up.

Use many acceptor sockets which listen the same ip:port, 1 individual acceptor socket in each thread, and the thread that receives the connection then processes it (recv/send)

Not the most portable option. I'd avoid it. Also, you'll potentially need to make your server process use multi-process (i.e. fork()) as opposed to multi-threaded, depending on OS, kernel version, etc.

Should we use multiple acceptor sockets to accept a large number of connections?

3 Answers