As known, SO_REUSEPORT allows multiple sockets to listen on the same IP address and port combination, it increases requests per second by 2 to 3 times, and reduces both latency (~30%) and the standard deviation for latency (8 times): https://www.nginx.com/blog/socket-sharding-nginx-release-1-9-1/
NGINX release 1.9.1 introduces a new feature that enables use of the SO_REUSEPORT socket option, which is available in newer versions of many operating systems, including DragonFly BSD and Linux (kernel version 3.9 and later). This socket option allows multiple sockets to listen on the same IP address and port combination. The kernel then load balances incoming connections across the sockets. ...
As shown in the figure, reuseport increases requests per second by 2 to 3 times, and reduces both latency and the standard deviation for latency.
SO_REUSEPORT
is available on most modern OS: Linux (kernel >= 3.9 since 29 Apr 2013), Free/Open/NetBSD, MacOS, iOS/watchOS/tvOS, IBM AIX 7.2, Oracle Solaris 11.1, Windows (is only SO_REUSEPORT
that behaves as 2 flags together SO_REUSEPORT
+SO_REUSEADDR
in BSD), and may be on Android : https://stackoverflow.com/a/14388707/1558037
Linux >= 3.9
- Additionally the kernel performs some "special magic" for
SO_REUSEPORT
sockets that isn't found in other operating systems: For UDP sockets, it tries to distribute datagrams evenly, for TCP listening sockets, it tries to distribute incoming connect requests (those accepted by callingaccept()
) evenly across all the sockets that share the same address and port combination. Thus an application can easily open the same port in multiple child processes and then useSO_REUSEPORT
to get a very inexpensive load balancing.
Also known, to avoid locks of spin-lock and achive high performance there shouldn't be sockets which read more than 1 thread. I.e. each thread should processes its own sockets for read/write.
accept()
is thread-safe function for the same socket descriptor, so it should be guarded by lock - so lock contention reduces performance: http://unix.derkeiler.com/Newsgroups/comp.unix.programmer/2007-06/msg00246.html
POSIX.1-2001/SUSv3 requires accept(), bind(), connect(), listen(), socket(), send(), recv(), etc. to be thread-safe functions. It's possible that there are some ambiguities in the standard regarding their interaction with threads, but the intention is that their behaviour in multithreaded programs is governed by the standard.
- If we use the same one socket from many threads, then performance will be low because socket defended by lock for thread-safe accessing from many threads: https://blog.cloudflare.com/how-to-receive-a-million-packets/
The receiving performance is down compared to a single threaded program. That's caused by a lock contention on the UDP receive buffer side. Since both threads are using the same socket descriptor, they spend a disproportionate amount of time fighting for a lock around the UDP receive buffer. This paper describes the problem in more detail.
- More details about spin-lock when the application tries to read data from the socket - "Analysis of Linux UDP Sockets Concurrent Performance": http://www.jcc2014.ucm.cl/jornadas/WORKSHOP/WSDP%202014/WSDP-4.pdf
V. K ERNEL ISOLATION
....
From the other side, when the application tries to read data from the socket, it executes a similar process, which isdescribed below and represented in Figure 3 from right to left:
1) Dequeue one or more packets from the receive queue, using the corresponding spinlock (green one).
2) Copy the information to user-space memory.
3) Release the memory used by the packet. This potentiallychanges the state of the socket, so two ways of locking the socket can occur: fast and slow. In both cases, the packet is unlinked from the socket, Memory Accounting statistics are updated and socket is released according to the locking path taken.
I.e. when many threads are accessing the same socket, performance degrades due to waiting on one spin-lock.
We have 2 x Xeon 32 HT-Cores server with 64 total HT-cores, and two 10 Gbit Ethernet cards, and Linux (kernel 3.9).
We use RFS and XPS - i.e. for the same connection TCP/IP-stack processed (kernel-space) on the same CPU-Core as an application thread (user-space).
There are at least 3 ways to accept connections to processes it at many threads:
- Use one acceptor socket shared between many threads, and each thread accept connections and processes it
- Use one acceptor socket in 1 thread, and this thread push received socket descriptors of connections to other thread-workers by using thread-safe queue
- Use many acceptor sockets which listen the same
ip:port
, 1 individual acceptor socket in each thread, and the thread that receives the connection then processes it (recv/send)
What is the more efficient way, If we accept a lot of new TCP-connections?