DPDK 17.11.1 - drops seen when doing destination based rate limiting

Question

Editing the problem statement to highlight more on the core logic

We are seeing performance issues when doing destination based rate limiting. We maintain state for every {destination-src} pair (max of 100 destinations and 2^16 sources). We have an array of 100 nodes and at each node we have a rte_hash*. This hash table is going to maintain the state of every source ip seen by that destination. We have a mapping for every destination seen (0 to 100) and this is used to index into the array. If a particular source exceeds a threshold defined for this destination in a second, we block the source, else we allow the source. At runtime, when we see only traffic for 2 or 3 destinations, there are no issues, but when we go beyond 5, we are seeing lot of drops. Our function has to do a lookup and identify the flow matching the dest_ip and src_ip. Process the flow and decide whether it needs dropping. If the flow is not found, add it to the hash.

struct flow_state {
    struct rte_hash* hash;    
};

struct flow_state flow_state_arr[100];

// am going to create these hash tables using rte_hash_create at pipeline_init and free them during pipeline_free.

Am outlining what we do in pseudocode.

run()
{
    1) do rx
    2) from the pkt, get index into the flow_state_arr and retrieve the rte_hash* handle    
    3) rte_hash_lookup_data(hash, src_ip,flow_data)
    4) if entry found, take decision on the flow (the decision is simply say rate limiting the flow)
    5) else rte_hash_add_data(hash,src_ip,new_flow_data) to add the flow to table and forward
}

Please guide if we can have these multiple hash table objects in data path or whats the best way if we need to handle states for every destination separately.

Edit
Thanks for answering. I will be glad to share the code snippets and our gathered results. I don't have comparison results for other DPDK versions, but below are some of the results for our tests using 17.11.1.

Test Setup
Am using IXIA traffic gen (using two 10G links to generate 12Mpps) for 3 destinations 14.143.156.x (in this case - 101,102,103). Each destination's traffic comes from 2^16 different sources. This is the traffic gen setup.

Code Snippet

    struct flow_state_t {
        struct rte_hash* hash;
        uint32_t size;
        uint64_t threshold;
    };
    struct flow_data_t {
        uint8_t curr_state; // 0 if blocked, 1 if allowed
        uint64_t pps_count;
        uint64_t src_first_seen;
    };
    struct pipeline_ratelimit {
        struct pipeline p;
        struct pipeline_ratelimit_params params;
        rte_table_hash_op_hash f_hash;
        uint32_t swap_field0_offset[SWAP_DIM];
        uint32_t swap_field1_offset[SWAP_DIM];
        uint64_t swap_field_mask[SWAP_DIM];
        uint32_t swap_n_fields;
        pipeline_msg_req_handler custom_handlers[2]; // handlers for add and del
        struct flow_state_t flow_state_arr[100];
        struct flow_data_t flows[100][65536];
    } __rte_cache_aligned;
    
    /*
      add_handler(pipeline,msg) -- msg includes index and threshold
      In the add handler
      a rule/ threshold is added for a destination
      rte_hash_create and store rte_hash* in flow_state_arr[index]
      max of 100 destinations or rules are allowed
      previous pipelines add the ID (index) to the packet to look in to the
      flow_state_arr for the rule
    */
    
    /*
      del_handler(pipeline,msg) -- msg includes index
      In the del handler
      a rule/ threshold @index is deleted
      the associated rte_hash* is also freed
      the slot is made free
    */
    
    #define ALLOWED 1
    #define BLOCKED 0
    #define TABLE_MAX_CAPACITY 65536
    int do_rate_limit(struct pipeline_ratelimit* ps, uint32_t id, unsigned char* pkt)
    {
        uint64_t curr_time_stamp = rte_get_timer_cycles();
        struct iphdr* iph = (struct iphdr*)pkt;
        uint32_t src_ip = rte_be_to_cpu_32(iph->saddr);
    
        struct flow_state_t* node = &ps->flow_state_arr[id];
        struct flow_data_t* flow = NULL
        rte_hash_lookup_data(node->hash, &src_ip, (void**)&flow);
        if (flow != NULL)
        {
            if (flow->curr_state == ALLOWED)
            {
                if (flow->pps_count++ > node->threshold)
                {
                    uint64_t seconds_elapsed = (curr_time_stamp - flow->src_first_seen) / CYCLES_IN_1SEC;
                    if (seconds_elapsed)
                    {
                        flow->src_first_seen += seconds_elapsed * CYCLES_IN_1_SEC;
                        flow->pps_count = 1;
                        return ALLOWED;
                    }
                    else
                    {
                        flow->pps_count = 0;
                        flow->curr_state = BLOCKED;
                        return BLOCKED;
                    }
                }
                return ALLOWED;
            }
            else
            {
                uint64_t seconds_elapsed = (curr_time_stamp - flow->src_first_seen) / CYCLES_IN_1SEC;
                if (seconds_elapsed > 120)
                {
                    flow->curr_state = ALLOWED;
                    flow->pps_count = 0;
                    flow->src_first_seen += seconds_elapsed * CYCLES_IN_1_SEC;
                    return ALLOWED;
                }
                return BLOCKED;
            }
        }
        int index = node->size;
        // If entry not found and we have reached capacity
        // Remove the rear element and mark it as the index for the new node    
        if (node->size == TABLE_MAX_CAPACITY)
        {
            rte_hash_reset(node->hash);
            index = node->size = 0;
        }
    
        // Add new element @packet_flows[mit_id][index]
        struct flow_data_t* flow_data = &ps->flows[id][index]; 
        *flow_data = { ALLOWED, 1, curr_time_stamp };
        node->size++;
    
        // Add the new key to hash
        rte_hash_add_key_data(node->hash, (void*)&src_ip, (void*)flow_data);    
        return ALLOWED;
    }
    static int pipeline_ratelimit_run(void* pipeline)
    {
        struct pipeline_ratelimit* ps = (struct pipeline_ratelimit*)pipeline;
    
        struct rte_port_in* port_in = p->port_in_next;
        struct rte_port_out* port_out = &p->ports_out[0];
        struct rte_port_out* port_drop = &p->ports_out[2];
    
        uint8_t valid_pkt_cnt = 0, invalid_pkt_cnt = 0;
        struct rte_mbuf* valid_pkts[RTE_PORT_IN_BURST_SIZE_MAX];
        struct rte_mbuf* invalid_pkts[RTE_PORT_IN_BURST_SIZE_MAX];
    
        memset(valid_pkts, 0, sizeof(valid_pkts));
        memset(invalid_pkts, 0, sizeof(invalid_pkts));
    
        uint64_t n_pkts;
    
        if (unlikely(port_in == NULL)) {
            return 0;
        }
    
        /* Input port RX */
        n_pkts = port_in->ops.f_rx(port_in->h_port, p->pkts,
            port_in->burst_size);
    
        if (n_pkts == 0)
        {
            p->port_in_next = port_in->next;
            return 0;
        }
    
        uint32_t rc = 0;
        char* rx_pkt = NULL;
    
        for (j = 0; j < n_pkts; j++) {
    
            struct rte_mbuf* m = p->pkts[j];
            rx_pkt = rte_pktmbuf_mtod(m, char*);
            uint32_t id = rte_be_to_cpu_32(*(uint32_t*)(rx_pkt - sizeof(uint32_t)));
            unsigned short packet_len = rte_be_to_cpu_16(*((unsigned short*)(rx_pkt + 16)));
    
            struct flow_state_t* node = &(ps->flow_state_arr[id]);
    
            if (node->hash && node->threshold != 0)
            {
                // Decide whether to allow of drop the packet
                // returns allow - 1, drop - 0
                if (do_rate_limit(ps, id, (unsigned char*)(rx_pkt + 14)))
                    valid_pkts[valid_pkt_count++] = m;
                else
                    invalid_pkts[invalid_pkt_count++] = m;
            }
            else
                valid_pkts[valid_pkt_count++] = m;
    
            if (invalid_pkt_cnt) {
                p->pkts_mask = 0;
                rte_memcpy(p->pkts, invalid_pkts, sizeof(invalid_pkts));
                p->pkts_mask = RTE_LEN2MASK(invalid_pkt_cnt, uint64_t);
                rte_pipeline_action_handler_port_bulk_mod(p, p->pkts_mask, port_drop);
            }
    
            p->pkts_mask = 0;
            memset(p->pkts, 0, sizeof(p->pkts));
    
            if (valid_pkt_cnt != 0)
            {
                rte_memcpy(p->pkts, valid_pkts, sizeof(valid_pkts));
                p->pkts_mask = RTE_LEN2MASK(valid_pkt_cnt, uint64_t);
            }
    
            rte_pipeline_action_handler_port_bulk_mod(p, p->pkts_mask, port_out);
    
            /* Pick candidate for next port IN to serve */
            p->port_in_next = port_in->next;
            return (int)n_pkts;
        }
}

RESULTS

When generated traffic for only one destination from 60000 sources with threshold of 14Mpps, there were no drops. We were able to send 12Mpps from IXIA and recv 12Mpps
Drops were observed after adding 3 or more destinations (each configured to recv traffic from 60000 sources). The throughput was only 8-9 Mpps. When sent for 100 destinations (60000 src each), only 6.4Mpps were handled. 50% drop was seen.
On running it through vtune-profiler, it reported rte_hash_lookup_data as the hotspot and mostly memory bound (DRAM bound). I will attach the vtune report soon.

there is lack of clarity in your question. You mention there is performance degradation with rte_hash in DPDK 17.11.1. But I am not able to see any test (performance) results run for DPDK 17.11.1 vs 17.11.10 vs 19.11.3. I am not able to find systematic isolation of rx/field extract/hash extract areas for your run. You have also not shared code snippet (can be done via pastebin) to suggest if it cacheline or algo issue. Happy to help if there are sufficient data, — Vipin Varghese
@VipinVarghese I have edited with code samples and some results. Also, mentioning here, the issue is not seen when the number of sources maintained per destination is fewer (in the order of 100s or 1000s). When we bump this number to 2^16 sources per destination, we see the issue. — Srivatsan Vijayaraghavan
ok thanks for the edit and the results, if you still believe rte_hash is causing the problem one needs to isolate the same by trying the hash logic in DPDK example/skeleton with the same hash logic. If you are able to reproduce the error, then it is rte_hash, If not then it processing logic error. — Vipin Varghese
@VipinVarghese any suggestions on if we are using the rte_hash library correctly. When there is only traffic for one destination, lookups are not slowed down even when we have 2^16 sources, but when we have multiple traffic contexts (the number of rte_hash contexts is linear with the number of destinations) and that is when we start to see the drop. Like you suggested, rte_hash is able to hash 2^16 sources, but how to handle multiple contexts is the challenge. — Srivatsan Vijayaraghavan
@Srivastsan as I have suggested if you feel rte_hash is limiting your performance and not the other code, the only way to isolate is to prototype on top of example/skeleton with your current hash logic. I am still not clear what is error, would be happy to hear you out in skype or meeting — Vipin Varghese

Vipin Varghese Vipin Varghese · Accepted Answer · 2020-09-09T02:08:28

Based on the update from internal testing, rte_hash library is not causing performance drops. Hence as suggested in comment is more likely due to current pattern and algorithm design which might be leading cache misses and lesser Instruction per Cycle.

To identify whether it is frontend stall or backend pipeline stall or memory stall please either use perf or vtune. Also try to minimize branching and use more likely and prefetch too.

DPDK 17.11.1 - drops seen when doing destination based rate limiting

1 Answers