Why does a function run faster when I call it continuously?

Question

I am deveoping a low latency service use c++ on linux. I do two group performance tests：

send 1 request per second， it's average latency is 3.5 microseconds.
send 10 request per second, it's average latency is 2.7 microseconds.

I cannot understand why? I guess call a function frequently, it may run faster. So I do a demo to test it。

#include <stdio.h>
#include <unistd.h>
#include <sys/time.h>
#include <syscall.h>
#include <thread>

using namespace std;

long long get_curr_nsec()
{
    struct timespec now;
    ::clock_gettime(CLOCK_MONOTONIC, &now);
    return now.tv_sec * 1000000000 + now.tv_nsec;
}

long long func(int n)
{
    long long t1 = get_curr_nsec();
    int sum = 0;
    for(int i = 0; i < n ;i++)
    {
        //make sure sum*= (sum+1) not be optimized by compiler
        __asm__ __volatile__("": : :"memory");
        sum *= (sum+1);
    }

    return get_curr_nsec() - t1;
}

bool bind_cpu(int cpu_id, pthread_t tid)
{
    int cpu = (int)sysconf(_SC_NPROCESSORS_ONLN);
    cpu_set_t cpu_info;
    
    if (cpu < cpu_id)
    {
        printf("bind cpu failed: cpu num[%d] < cpu_id[%d]\n", cpu, cpu_id);
        return false;
    }
    
    CPU_ZERO(&cpu_info);
    CPU_SET(cpu_id, &cpu_info);
    
    int ret = pthread_setaffinity_np(tid, sizeof(cpu_set_t), &cpu_info);
    if (ret)
    {
        printf("bind cpu failed, ret=%d\n", ret);
        return false;
    }
    
    return true;
}
int main(int argc, char **argv)
{
    //make sure the program would not swich cpu
    bind_cpu(3, ::pthread_self());

    //first argv：call times
    //second argv：interval between call function
    int times = ::atoi(argv[1]);
    int interval = ::atoi(argv[2]);

    long long sum = 0;
    for(int i = 0; i < times; i++)
    {
        if(n > 0)
        {
                std::this_thread::sleep_for(std::chrono::milliseconds(interval));
        }
        sum +=  func(100);
    }

    printf("avg elapse:%lld ns\n", sum/ times);
    return 0;
}

The compile command： g++ --std=c++11 ./main.cpp -O2 -lpthread, And I do the below tests:

Call function 100 times without sleep, ./a.out 100 0, output:avg elapse:35 ns
Call function 100 times with sleep 1 ms, ./a.out 100 1, output:avg elapse:36 ns
Call function 100 times with sleep 10 ms, ./a.out 100 10, output:avg elapse:40 ns
Call function 100 times with sleep 100 ms, ./a.out 100 100, output:avg elapse:45 ns
Call function 100 times with sleep 1000 ms, ./a.out 100 1000, output:avg elapse:50 ns

My OS is CentOS Linux release 7.6.1810 (Core) My CPU is Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz

I am confused. I do not know why? CPU ? OS? System Call(sleep) ?

Afterwards I use perf to stat branches:

perf stat ./a.out 100 1, there are 241779 branches,7091 branch-misses;
perf stat ./a.out 100 100, there are 241791 branches, 7636 branch-misses.

It seems sleep 100 ms has more branch-misses. But I am still not certain this is the reason, And I don't know why sleep 100 ms has more branch-misses.

I don't know what the deciding factor is in this case, but I am not surprised by the qualitative outcome. On the one hand, if you quickly repeat an action you benefit from caching of all kinds, while you don't if you do something rarely. On the other hand, I don't see any reason that doing something rarely should make it faster, except for cases where the CPU is put under so heavy load that throttling and similar effects become relevant. — user17732522
Thanks. I considered the CPU cache. But I do not known what's the difference between sleep 1ms and sleep 10ms? — Frank Liu
Another considerations potentially in play are branch prediction. speculative execution, and pipeline stalling (when a branch prediction is wrong and speculatively executed instruction stream needs to be backed up). Doing something rarely makes it more difficult to accurately predict which path should be taken, and therefore increases chances of stalling the instruction pipeline. The specifics depend on what strategies a particular CPU uses to speculatively predict which path to execute, how many instructions it executes before having to stall, etc. — Peter
@ n. 1.8e9-where's-my-share m. Thanks. I execute cpupower frequency-info, it shows: available cpufreq governors: performance powersave current policy: frequency should be within 1.20 GHz and 5.00 GHz. The governor "performance" may decide which speed to use within this range. — Frank Liu