1
votes

Hello Stackoverflow experts,

I am having trouble applying ip fragmentation in multiple cores.

My ultimate questions is whether it is possible to have multiple ip fragmentation table allocated with each different direct and indirect pool.

Really thankful if somebody can point out what I am doing wrong here or provide me with some alternative solutions.


Purpose

I am trying to apply ip fragmentation in multiple cores, and maximize the throughput performance with messages that are bigger than MTU.

  • for each local & remote host (using 1 to 8 logical cores)
  • 1 ~ 4 for transferring fragmented message
  • 4 ~ 8 for receiving and assemble message
  • Sending 4KB message from local
  • remote echos the message back to the local.
  • calculate the total throughput


Problem

If I try to allocate fragmentation table to each of the cores, I get a segmentation error, and this happens no matter I shrink the size of the fragmentation table. The way I have tried to allocate the pools and frag-table are like this below.

for each coreid, coreid < allocated cores; coreid++
     fragmentation_table[coreid] = rte_ip_frag_table_create(...);
     direct_pool[coreid] = rte_pktmbuf_pool_create(...);
     indirect_pool[coreid] rte_pktmbuf_pool_create(...);

So alternatively, I have allocated multiple fragmentation table for each lcores but let the direct and indirect pools be shared together.

for each coreid, coreid < allocated cores; coreid++
     fragmentation_table[coreid] = rte_ip_frag_table_create(...);

direct_pool = rte_pktmbuf_pool_create(...);
indirect_pool = rte_pktmbuf_pool_create(...);


Situation

Now when I send messages using multiple cores from local to remote host, the remote host only successfully receive the message when I send the message adding a delay such as ( adding sleep(1); for each message send. ) I was able to receive the message from local to remote. But, cannot receive any data when I try to send them without any delay.


Conclusion

Personally, I suspect that I should allocate direct pool and indirect pool for each logical cores, and I think that is the main issue. Since I was only able to successfully use the frag-table using only one logical core, I suspect that I am not using the fragmentation table correctly in multiple cores.

I really want to hear from the DPDK experts about this issue I am facing, and will be really grateful for any advice on this...

static int
setup_queue_tbl(struct lcore_rx_queue *rxq, uint32_t lcore, uint32_t queue)
{
    int socket;
    uint32_t nb_mbuf;
    uint64_t frag_cycles;
    char buf[RTE_MEMPOOL_NAMESIZE];

    socket = rte_lcore_to_socket_id(lcore);
    if (socket == SOCKET_ID_ANY)
        socket = 0;

    frag_cycles = (rte_get_tsc_hz() + MS_PER_S - 1) / MS_PER_S * max_flow_ttl;
    if ((rxq->frag_tbl = rte_ip_frag_table_create(max_flow_num,
                                                  IP_FRAG_TBL_BUCKET_ENTRIES, max_flow_num, frag_cycles,
                                                  socket)) == NULL) {
        printf("rte_ip_frag_table_create failed!!!!\n");
        return -1;
    }

    nb_mbuf = RTE_MAX(max_flow_num, 2UL * MAX_PKT_BURST) * MAX_FRAG_NUM;
    nb_mbuf *= (port_conf.rxmode.max_rx_pkt_len + BUF_SIZE - 1) / BUF_SIZE;
    nb_mbuf *= 1;
    nb_mbuf += nb_rxd + nb_txd;

    if (transfer_pool[lcore] == NULL) {

        snprintf(buf, sizeof(buf), "pool_recieve_%d", socket);
        receive_pool[lcore] = rte_pktmbuf_pool_create(buf, nb_mbuf, MEMPOOL_CACHE_SIZE, 0, RTE_MBUF_DEFAULT_BUF_SIZE, socket);

        snprintf(buf, sizeof(buf), "pool_transfer_%d", socket);
        transfer_pool[lcore] = rte_pktmbuf_pool_create(buf, nb_mbuf, MEMPOOL_CACHE_SIZE, 0, PKT_SIZE + 128, socket);

        snprintf(buf, sizeof(buf), "pool_direct_%d", socket);
        direct_pool[lcore] = rte_pktmbuf_pool_create(buf, nb_mbuf, MEMPOOL_CACHE_SIZE, 0, RTE_MBUF_DEFAULT_BUF_SIZE, socket);

        snprintf(buf, sizeof(buf), "pool_indirect_%d", socket);
        indirect_pool[lcore] = rte_pktmbuf_pool_create(buf, nb_mbuf, MEMPOOL_CACHE_SIZE, 0, 0, socket);
    }

    snprintf(buf, sizeof(buf), "mbuf_rx_ring_%d", lcore);
    rx_ring[lcore] = rte_ring_create(buf, 512, socket, 0);
    snprintf(buf, sizeof(buf), "mbuf_tx_ring_%d", lcore);
    tx_ring[lcore] = rte_ring_create(buf, 512, socket, 0);

    // for packet assemble
    rxq->ar = (struct assembled_result *)malloc(sizeof(struct assembled_result));
    rxq->ar->length = 0;
    rxq->ar->assembled_pkt = (char *)malloc(sizeof(char)*PKT_SIZE);
    return 0;
}

Here is the complete source-cod, the code that I want to make sure that it is correct is in dpdk_init.h

https://github.com/SungHoHong2/DPDK-Experiment/blob/master/dpdk-server-multi/dpdk_init.h

2

2 Answers

2
votes

1. Please provide source code

It will help to get you answers, not guesses ;)

2. Fragment Table vs lcores

DPDK Programmers Guide clearly states:

all update/lookup operations on Fragment Table are not thread safe.

Source

So each lcore must have its own fragment table or locks must be used.

3. Memory Pools vs lcores

By default memory pools in DPDK are thread safe, unless we pass a flag like MEMPOOL_F_SP_PUT. So, answering your question:

whether it is possible to have multiple ip fragmentation table allocated with each different direct and indirect pool.

By default, few lcores can share memory pools.

4. Guess!

As there is no source code, so I guess the root cause is that the TTL for the fragments is less than 1 second, so with sleep(1) packets arrive too late to get reassembled.

5. Side note

Reassembly is very time- and space-consuming operation and should be avoided at all costs.

Consider some ways to fit your message into one packet by changing the protocol or using jumbo frames.

1
votes

The answer was a very basic one.. The problem was in the configuration of the hugepages. The hugepage size was different for each of the clusters that I wast testing.

One cluster that was allocating frag-tables was ...

AnonHugePages:    208896 kB
HugePages_Total:       8
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:    1048576 kB

While the other cluster that was keep returning segmentation faults when I attempted to allocate frag-tables was ...

AnonHugePages:      6144 kB 
HugePages_Total:    1024    
HugePages_Free:        0    
HugePages_Rsvd:        0    
HugePages_Surp:        0    
Hugepagesize:       2048 kB 

Here is the basic performance of scaling rx and tx -queues per each logical cores, As you can see it is possible to use multiple rx-tx queues each associated with frag-table with logical cores and required pools.

enter image description here