Profiling Cache hit rate of a function of C program

Question

I want to get cache hit rate for a specific function of a C/C++ program (foo) running on a Linux machine. I am using gcc and no compiler optimization. With perf I can get hit rates for the entire program using the following command.

perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses ./a.out

But I am interested in the kernel foo only.

Is there a way to get hit rates only for foo using perf or any other tool?

#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <math.h>
#include <stdlib.h>


#define NI 192
#define NJ NI

#ifndef DATA_TYPE
    #define DATA_TYPE float
#endif


static 
void* xmalloc(size_t num)
{
    void * nnew = NULL;
    int ret = posix_memalign (&nnew, 32, num);
    if(!nnew || ret)
    {
        fprintf(stderr, "Can not allocate Memory\n");
        exit(1);
    }
    return nnew;
}

void* alloc_data(unsigned long long int n, int elt_size)
{
    size_t val = n;
    val *= elt_size;
    void* ret = xmalloc(val);
    return ret;
}


/* Array initialization. */
static
void init_array(int ni, int nj,
        DATA_TYPE A[NI][NJ],
        DATA_TYPE R[NJ][NJ],
        DATA_TYPE Q[NI][NJ])
{
  int i, j;

  for (i = 0; i < ni; i++)
    for (j = 0; j < nj; j++) {
      A[i][j] = ((DATA_TYPE) i*j) / ni;
      Q[i][j] = ((DATA_TYPE) i*(j+1)) / nj;
    }
  for (i = 0; i < nj; i++)
    for (j = 0; j < nj; j++)
      R[i][j] = ((DATA_TYPE) i*(j+2)) / nj;
}


/* Main computational kernel.*/

static
void foo(int ni, int nj,
        DATA_TYPE A[NI][NJ],
        DATA_TYPE R[NJ][NJ],
        DATA_TYPE Q[NI][NJ])
{
  int i, j, k;

  DATA_TYPE nrm;
  for (k = 0; k < nj; k++)
  {
    nrm = 0;
    for (i = 0; i < ni; i++)
      nrm += A[i][k] * A[i][k];
    R[k][k] = sqrt(nrm);
    for (i = 0; i < ni; i++)
      Q[i][k] = A[i][k] / R[k][k];
    for (j = k + 1; j < nj; j++)
    {
      R[k][j] = 0;
      for (i = 0; i < ni; i++)
        R[k][j] += Q[i][k] * A[i][j];
      for (i = 0; i < ni; i++)
        A[i][j] = A[i][j] - Q[i][k] * R[k][j];
    }
  }
}


int main(int argc, char** argv)
{
  /* Retrieve problem size. */
  int ni = NI;
  int nj = NJ;

  /* Variable declaration/allocation. */
  DATA_TYPE (*A)[NI][NJ];
  DATA_TYPE (*R)[NI][NJ];
  DATA_TYPE (*Q)[NI][NJ];

  A = ((DATA_TYPE (*)[NI][NJ])(alloc_data((NI*NJ), (sizeof(DATA_TYPE)))));
  R = ((DATA_TYPE (*)[NI][NJ])(alloc_data((NI*NJ), (sizeof(DATA_TYPE)))));
  Q = ((DATA_TYPE (*)[NI][NJ])(alloc_data((NI*NJ), (sizeof(DATA_TYPE)))));
  
/* Initialize array(s). */
  init_array (ni, nj,
          (*A),
          (*R),
          (*Q));


  /* Run kernel. */
  foo (ni, nj, *A, *R, *Q);

  /* Be clean. */
  free((void *)A);
  free((void *)R);
  free((void *)Q);

  return 0;
}

Output of lscpu command is:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15 
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel 
CPU family:            6
Model:                 63
Model name:            Intel(R) Core(TM) i7-5960X CPU @ 3.00GHz
Stepping:              2
CPU max MHz:           3500.0000
CPU min MHz:           1200.0000
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0-15

What your want is a caliper measurement: a "start counter" before calling foo() and a "stop counter" upon the end of foo(). To make it, you will need to instrument the code and rebuild it. The ability to get the counters depends on the processor architecture and its PMU. The way to get the counters is vendor specific. That is why libraries like papi are useful as they support multiple processor/PMU architectures transparently. Why were you not able to use papi ? — Rachid K.
@hyde: That would include counts for dynamic-linking, and for the alloc / initialize part. You can count only user-space by using perf stat --all-user (or with older perf, with event:u,event:u,...) So yeah you could just time the whole program if you can repeat foo a lot of times to drown out the background noise of the init work; if it can be run multiple times without redoing its init. But that may be impractical if you want to run foo with a large array that includes a lot of init time. — Peter Cordes
@PeterCordes Could use static linking. Could precalculate the array. — hyde
But this is returning me error code -8(Event exists, but cannot be counted due to counter resource limitations) when I try to add those events using PAPI_add_event function. It fails when I try to add three events. If I add only two events, it works fine. — Atanu Barai

breiters breiters · Accepted Answer · 2021-02-05T11:37:53

You could also use Likwid and its Marker-API. It makes it very easy to instrument certain regions of your code. You can use the predefined performance group ICACHE on the haswell architecture for the L1 cache miss rate or define your own performance group for the L1 hit rate.

#include likwid.h
LIKWID_MARKER_INIT;
LIKWID_MARKER_START("region foo");

foo();

LIKWID_MARKER_STOP("region foo");
LIKWID_MARKER_CLOSE;

run application with:

./likwid-perfctr -g ICACHE -m <your application>

Make sure to compile with -DLIKWID-PERFMON and add the Likwid include and library path and link the Likwid library: -L$LIKWID_LIB -I$LIKWID_INCLUDE -llikwid. Everything is documented very well on their github wiki

Profiling Cache hit rate of a function of C program

3 Answers