3
votes

I want to get cache hit rate for a specific function of a C/C++ program (foo) running on a Linux machine. I am using gcc and no compiler optimization. With perf I can get hit rates for the entire program using the following command.

perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses ./a.out

But I am interested in the kernel foo only.

Is there a way to get hit rates only for foo using perf or any other tool?

#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <math.h>
#include <stdlib.h>


#define NI 192
#define NJ NI

#ifndef DATA_TYPE
    #define DATA_TYPE float
#endif


static 
void* xmalloc(size_t num)
{
    void * nnew = NULL;
    int ret = posix_memalign (&nnew, 32, num);
    if(!nnew || ret)
    {
        fprintf(stderr, "Can not allocate Memory\n");
        exit(1);
    }
    return nnew;
}

void* alloc_data(unsigned long long int n, int elt_size)
{
    size_t val = n;
    val *= elt_size;
    void* ret = xmalloc(val);
    return ret;
}


/* Array initialization. */
static
void init_array(int ni, int nj,
        DATA_TYPE A[NI][NJ],
        DATA_TYPE R[NJ][NJ],
        DATA_TYPE Q[NI][NJ])
{
  int i, j;

  for (i = 0; i < ni; i++)
    for (j = 0; j < nj; j++) {
      A[i][j] = ((DATA_TYPE) i*j) / ni;
      Q[i][j] = ((DATA_TYPE) i*(j+1)) / nj;
    }
  for (i = 0; i < nj; i++)
    for (j = 0; j < nj; j++)
      R[i][j] = ((DATA_TYPE) i*(j+2)) / nj;
}


/* Main computational kernel.*/

static
void foo(int ni, int nj,
        DATA_TYPE A[NI][NJ],
        DATA_TYPE R[NJ][NJ],
        DATA_TYPE Q[NI][NJ])
{
  int i, j, k;

  DATA_TYPE nrm;
  for (k = 0; k < nj; k++)
  {
    nrm = 0;
    for (i = 0; i < ni; i++)
      nrm += A[i][k] * A[i][k];
    R[k][k] = sqrt(nrm);
    for (i = 0; i < ni; i++)
      Q[i][k] = A[i][k] / R[k][k];
    for (j = k + 1; j < nj; j++)
    {
      R[k][j] = 0;
      for (i = 0; i < ni; i++)
        R[k][j] += Q[i][k] * A[i][j];
      for (i = 0; i < ni; i++)
        A[i][j] = A[i][j] - Q[i][k] * R[k][j];
    }
  }
}


int main(int argc, char** argv)
{
  /* Retrieve problem size. */
  int ni = NI;
  int nj = NJ;

  /* Variable declaration/allocation. */
  DATA_TYPE (*A)[NI][NJ];
  DATA_TYPE (*R)[NI][NJ];
  DATA_TYPE (*Q)[NI][NJ];

  A = ((DATA_TYPE (*)[NI][NJ])(alloc_data((NI*NJ), (sizeof(DATA_TYPE)))));
  R = ((DATA_TYPE (*)[NI][NJ])(alloc_data((NI*NJ), (sizeof(DATA_TYPE)))));
  Q = ((DATA_TYPE (*)[NI][NJ])(alloc_data((NI*NJ), (sizeof(DATA_TYPE)))));
  
/* Initialize array(s). */
  init_array (ni, nj,
          (*A),
          (*R),
          (*Q));


  /* Run kernel. */
  foo (ni, nj, *A, *R, *Q);

  /* Be clean. */
  free((void *)A);
  free((void *)R);
  free((void *)Q);

  return 0;
}

Output of lscpu command is:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15 
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel 
CPU family:            6
Model:                 63
Model name:            Intel(R) Core(TM) i7-5960X CPU @ 3.00GHz
Stepping:              2
CPU max MHz:           3500.0000
CPU min MHz:           1200.0000
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0-15
3
Write a program which only runs foo() and measure it?hyde
What your want is a caliper measurement: a "start counter" before calling foo() and a "stop counter" upon the end of foo(). To make it, you will need to instrument the code and rebuild it. The ability to get the counters depends on the processor architecture and its PMU. The way to get the counters is vendor specific. That is why libraries like papi are useful as they support multiple processor/PMU architectures transparently. Why were you not able to use papi ?Rachid K.
@hyde: That would include counts for dynamic-linking, and for the alloc / initialize part. You can count only user-space by using perf stat --all-user (or with older perf, with event:u,event:u,...) So yeah you could just time the whole program if you can repeat foo a lot of times to drown out the background noise of the init work; if it can be run multiple times without redoing its init. But that may be impractical if you want to run foo with a large array that includes a lot of init time.Peter Cordes
@PeterCordes Could use static linking. Could precalculate the array.hyde
But this is returning me error code -8(Event exists, but cannot be counted due to counter resource limitations) when I try to add those events using PAPI_add_event function. It fails when I try to add three events. If I add only two events, it works fine.Atanu Barai

3 Answers

2
votes

You could also use Likwid and its Marker-API. It makes it very easy to instrument certain regions of your code. You can use the predefined performance group ICACHE on the haswell architecture for the L1 cache miss rate or define your own performance group for the L1 hit rate.

#include likwid.h
LIKWID_MARKER_INIT;
LIKWID_MARKER_START("region foo");

foo();

LIKWID_MARKER_STOP("region foo");
LIKWID_MARKER_CLOSE;

run application with:

./likwid-perfctr -g ICACHE -m <your application>

Make sure to compile with -DLIKWID-PERFMON and add the Likwid include and library path and link the Likwid library: -L$LIKWID_LIB -I$LIKWID_INCLUDE -llikwid. Everything is documented very well on their github wiki

1
votes

You might be interested in gprof(1). It won't measure cache hit rate (this has no sense, since some calls to foo could be inlined, once GCC is invoked with optimizations enabled).

You could use libbacktrace in your code. See also time(7) and signal(7).

You might compile your code with gcc -Wall -Wextra -O2 -g -pg then use libbacktrace (like GCC or RefPerSys are doing) inside it, and later gprof(1) with gdb(1).

With efforts (so read Advanced Linux Programming then syscalls(2) and signal-safety(7)) you might use setitimer(2) with sigaction(2) and/or profil(3).

Consider also generating some C code (e.g. using GPP and/or GNU bison in your own C code generator) and see this answer. J.Pitrat's book Artificial Beings: the Conscience of a Conscious Machine (ISBN-13: 978-1848211018) could be inspirational. You may want to generate some C code for extra instrumentation.

You might generate some code in a plugin (e.g. with libgccjit or GNU lightning...) at runtime, then dlopen(3) and dlsym(3) it. Read more about partial evaluation and see my manydl.c example, and more seriously the source code of Ocaml or of SBCL.

You could write your GCC plugin to automatically generate some measurements, in a more clever way than what the -pg option of GCC is doing. Your GCC plugin would transform (at the GIMPLE level) most function calls to something more complex making some benchmarking (this is how -pg works inside GCC, and you might study the source code of GCC). Try compiling your foo.c as gcc -Wall -Wextra -O2 -pg -S -fverbose-asm foo.c and look into the generated foo.s, perhaps adding more optimizations, or static analysis or instrumentation options.

You could be interested in recent papers of ACM SIGPLAN.

At last, benchmarking a C program compiled without optimizations makes no sense. Consider instead compiling and linking your program with at least gcc -flto -O2 -Wall

Within your foo, you might use cleverly clock_gettime(2) to measure CPU time.

And if performance is very important and if you are allowed to spend weeks of work to improve it, you might consider using OpenCL (or perhaps CUDA) to compute your kernel on a powerful GPGPU. Of course, you need dedicated hardware. Otherwise, consider using OpenMP or OpenACC (or perhaps MPI). Some recent GCC compilers (at least GCC 10 in October 2020) could support these. Of course, read documentation on Invoking GCC.

1
votes

First, note that L1-dcache-store-misses is not supported on your processor. perf stat will tell you that in the output.

perf stat doesn't let you profile only selected regions of code. To do that, you have to manually instrument the code so that the specified events are controlled around the regions of interest as desired.

It's not possible to count the events L1-dcache-loads, L1-dcache-load-misses, and L1-dcache-stores without multiplexing on your processor (Haswell). They're mapped to the native events MEM_UOPS_RETIRED.ALL_LOADS, L1D.REPLACEMENT, and MEM_UOPS_RETIRED.ALL_STORES, respectively. Each of these events can only be counted the first four general-purpose counters. In addition, there is a bug that is not documented in the specification update document of the i7-5960X, but does exist in the i7-5960X (it's documented in the spec update documents of other Haswell processors and processors of some other microarchitectures). This bug is handled differently in different versions of perf. Starting with kernel version 4.1-rc7, if one of the events that is affected by the bug is enabled on a logical core and if hyperthreading is enabled at boot-time, a logical core can only use up to two of its four general-purpose counters. The MEM_UOPS_RETIRED.* events are among the ones affected by the bug. One thing you can do is disable hyperthreading.

It's important to understand what kind of "cache hit rate" can be measured with these events. You probably don't want to measure something that doesn't make sense. One ratio that may make sense is L1-dcache-load-misses / (L1-dcache-loads + L1-dcache-stores), which represents the number of L1D replacements (lines filled in the cache that cause others to be evicted) for any reason divided by the number of retired load and store uops. Not all misses cause replacements and a significant portion of all misses may hit in the LFBs, which also don't cause replacements. Also not all replacement are caused by accesses from uops that end up retiring.