0
votes

I try to maximize throughput on a single RX-Queue of my network card. My current setup is utilizing the Shared Umem feature by setting up multiple sockets on the same RX-Queue, each with a reference to the same Umem.

My Kernel XDP-Program then assigns streams of packets to the correct socket via a BPF_MAP_TYPE_XSKMAP. This all works fine but at around 600.000 pps, ksoftirqd/18 reaches 100% CPU load (I moved my userspace application to another core via taskset -c 1 to reduce load on Core 18). My userspace app doesn't have more than 14% CPU load so unfortunately, the reason why I am not able to process any more packets is because of the huge amount of interrupts.

I then read about the xdp bind-flag XDP_USE_NEED_WAKEUP which sends the Umem Fill-Ring to sleep thus reducing interrupt overhead (as far as I understand it correctly, there is not a lot of information out there on this topic). Because Umem Fill-Ring might be sleeping, one has to regularly check:

    if (xsk_ring_prod__needs_wakeup(&umem->fq)) {
        const int ret = poll(fds, len, 10);
    }

fds are struct pollfd containing the filedescriptor of every socket. I am not quite sure where to add the XDP_USE_NEED_WAKEUP flag but here is how I use it:

static struct xsk_socket_info *xsk_configure_socket(struct xsk_umem_info *umem, struct config *cfg,
                                                    const bool rx, const bool tx) {
    struct xsk_socket_config xsk_socket_cfg;
    struct xsk_socket_info *xsk;
    struct xsk_ring_cons *rxr;
    struct xsk_ring_prod *txr;
    int ret;

    xsk = calloc(1, sizeof(*xsk));
    if (!xsk) {
        fprintf(stderr, "xsk `calloc` failed: %s\n", strerror(errno));
        exit(1);
    }

    xsk->umem = umem;
    xsk_socket_cfg.rx_size = XSK_CONS_AMOUNT;
    xsk_socket_cfg.tx_size = XSK_PROD_AMOUNT;
    if (cfg->ip_addrs_len > 1) {
        xsk_socket_cfg.libbpf_flags = XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD;
    } else {
        xsk_socket_cfg.libbpf_flags = 0;
    }
    xsk_socket_cfg.xdp_flags = cfg->xdp_flags;
    xsk_socket_cfg.bind_flags = cfg->xsk_bind_flags | XDP_USE_NEED_WAKEUP;

    rxr = rx ? &xsk->rx : NULL;
    txr = tx ? &xsk->tx : NULL;
    ret = xsk_socket__create(&xsk->xsk, cfg->ifname_buf, cfg->xsk_if_queue, umem->umem, rxr, txr, &xsk_socket_cfg);
    if (ret) {
        fprintf(stderr, "`xsk_socket__create` returned error: %s\n", strerror(errno));
        exit(-ret);
    }

    return xsk;
}

I observed that it had a small impact on the load of ksoftirqd/18 and I was able to process 50.000 pps more than before (but this could also be because of changes to the general load of the system - I am not sure :/). But I also noticed, that XDP_USE_NEED_WAKEUP doesn't work for Shared Umem because libbpf has this code in xsk.c:

sxdp.sxdp_family = PF_XDP;
sxdp.sxdp_ifindex = xsk->ifindex;
sxdp.sxdp_queue_id = xsk->queue_id;
if (umem->refcount > 1) {
    sxdp.sxdp_flags = XDP_SHARED_UMEM;
    sxdp.sxdp_shared_umem_fd = umem->fd;
} else {
    sxdp.sxdp_flags = xsk->config.bind_flags;

As you can see, bind_flags are only used if Umem has a refcount of 1 (it can't be less than that because it is incremented somewhere above in xsk_socket__create). But because for every created socket, refcount is increased - these bind_flags are only used for the first socket (where refcount is still <= 1).

I don't quite understand why XDP_USE_NEED_WAKEUP can only be used for one socket? In fact, I don't understand why this flag is related to the socket at all if it actually affects the Umem?

Nevertheless, I am searching for a way to reduce interrupt overhead - any ideas how this could be achieved? I need to have at least 1.000.000 pps.

1

1 Answers

0
votes

That code in xsk.c simply ensures that either all sockets with the same UMEM use XSK_NEEDS_WAKEUP, or none of them do, i.e. if you configure the first socket (not created as a shared socket) to have XSK_NEEDS_WAKEUP enabled, all of the shared sockets attached to the same umem afterwards will have this flag enabled as well, and vice versa. I'm not 100% sure as to why it was decided to do so, but the XSK_NEEDS_WAKEUP affects all the rings that userspace is producer for, so the TX ring and the FILL ring both. Since the FILL ring is tied to the UMEM and not the socket, this flag affects the shared UMEM and can therefore not be different accross sockets that share the same UMEM.

This also answers your question why the flag affects the socket: it is also the TX ring that needs to be woken up by the user every time we write to it (if enabled and necessary). From the kernel's point of view, it just sees 2 rings for which unpredictable userspace is the producer and it offers the option to be kind and not poll the hell out of your rings when it's not necessary. I don't see a reason why in the future there couldn't be different flags for the TX ring and the FILL ring, or different flags for the FILL ring based on the socket's queue id, but then again, I'm no kernel developer.

As for where to enable it, you're correct in assuming that it is part of the bind flags (https://www.kernel.org/doc/html/latest/networking/af_xdp.html#xdp-use-need-wakeup-bind-flag).

I also noticed you use a 10ms timeout value in your poll() call. I'm not exactly sure about this, but in my testing it seems that it is not the result of the poll() call that is important, but the very fact that it is called is notification enough for the kernel to accept that you've woken it up. Furthermore, I'm not sure if a packet being received on an AF_XDP socket even constitutes as an event that can be captured by poll(), so the kernel might really take you up on that 10ms delay everytime. I've taken the habit of setting the timeout value to 0 for this reason: it is the kernel that needs to be woken up, not you that needs to be notified about anything.

As for your question about improvement of packet rates, in my testing from about 800.000 pps on (64-byte ones for reference) you can't really do any XDP benchmarking without enabling XDP_DRV mode, since that is the point where in my testing the allocation and deallocation of SKB buffers became the source of packet loss. After that, the bottleneck might become your userspace application, or the amount of RX queues you've got on your NIC, that's hard to tell without seeing more code.