I have developed a code that gets as input a large 2-D image (up to 64MPixels) and:
- Applies a filters on each row
- Transposes the image (used blocking to avoid lots of cache misses)
- Applies a filters on the columns (now-rows) of the image
- Transposes the filtered image back to carry on with other calculations
Although it doesn't change something, for the sake of completeness of my question, the filtering is applying a discrete wavelet transform and the code is written in C.
My end goal is to make this run as fast as possible. The speedups I have so far are more than 10X times through the use of the blocking matrix transpose, vectorization, multithreading, compiler-friendly code etc.
Coming to my question: The latest profiling stats of the code I have (using perf stat -e
) have troubled me.
76,321,873 cache-references
8,647,026,694 cycles # 0.000 GHz
7,050,257,995 instructions # 0.82 insns per cycle
49,739,417 cache-misses # 65.171 % of all cache refs
0.910437338 seconds time elapsed
The (# of cache-misses)/(# instructions) is low at around ~0.7%. Here it is mentioned that this number is a good thing to have in mind to check for memory efficiency.
On the other hand, the % of cache-misses to cache-references is significantly high (65%!) which as I see could indicates that something is going wrong with the execution in terms of cache efficiency.
The detailed stat from perf stat -d
is:
2711.191150 task-clock # 2.978 CPUs utilized
1,421 context-switches # 0.524 K/sec
50 cpu-migrations # 0.018 K/sec
362,533 page-faults # 0.134 M/sec
8,518,897,738 cycles # 3.142 GHz [40.13%]
6,089,067,266 stalled-cycles-frontend # 71.48% frontend cycles idle [39.76%]
4,419,565,197 stalled-cycles-backend # 51.88% backend cycles idle [39.37%]
7,095,514,317 instructions # 0.83 insns per cycle
# 0.86 stalled cycles per insn [49.66%]
858,812,708 branches # 316.766 M/sec [49.77%]
3,282,725 branch-misses # 0.38% of all branches [50.19%]
1,899,797,603 L1-dcache-loads # 700.724 M/sec [50.66%]
153,927,756 L1-dcache-load-misses # 8.10% of all L1-dcache hits [50.94%]
45,287,408 LLC-loads # 16.704 M/sec [40.70%]
26,011,069 LLC-load-misses # 57.44% of all LL-cache hits [40.45%]
0.910380914 seconds time elapsed
Here frontend and backend stalled cycles are also high and the lower level caches seem to suffer from a high miss rate of 57.5%.
Which metric is the most appropriate for this scenario? One idea I was thinking is that it could be the case that the workload no longer requires further "touching" of the LL caches after the initial image load (loads the values once and after that it's done - the workload is more CPU-bound than memory-bound being an image filtering algorithm).
The machine I'm running this on is a Xeon E5-2680 (20M of Smart cache, out of which 256KB L2 cache per core, 8 cores).
perf stat -d
may be inaccurate, it can be useful to rerun it several times with 4-5 events per run,perf stat -e L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-loads-misses
. If your stages can be separated, you may measure every stage to find which one generates misses. Or you can runperf record -e cache-misses
to get profile of code which has most misses (as was recommended in the "Here" you linked). – osgxperf stat -e event1:u,event2:u,...
and compare with posted results (by default both user and kernel are measured). kernel-kallsyms mean that some function from kernel-space had this event... Some possible issues of unknown/unresolved symbols are listed here: brendangregg.com/blog/2014-06-22/perf-cpu-sample.html. I can't give you advises about metrics.... – osgx