6
votes

I have developed a code that gets as input a large 2-D image (up to 64MPixels) and:

  • Applies a filters on each row
  • Transposes the image (used blocking to avoid lots of cache misses)
  • Applies a filters on the columns (now-rows) of the image
  • Transposes the filtered image back to carry on with other calculations

Although it doesn't change something, for the sake of completeness of my question, the filtering is applying a discrete wavelet transform and the code is written in C.

My end goal is to make this run as fast as possible. The speedups I have so far are more than 10X times through the use of the blocking matrix transpose, vectorization, multithreading, compiler-friendly code etc.

Coming to my question: The latest profiling stats of the code I have (using perf stat -e) have troubled me.

        76,321,873 cache-references                                            
     8,647,026,694 cycles                    #    0.000 GHz                    
     7,050,257,995 instructions              #    0.82  insns per cycle        
        49,739,417 cache-misses              #   65.171 % of all cache refs    

       0.910437338 seconds time elapsed

The (# of cache-misses)/(# instructions) is low at around ~0.7%. Here it is mentioned that this number is a good thing to have in mind to check for memory efficiency.

On the other hand, the % of cache-misses to cache-references is significantly high (65%!) which as I see could indicates that something is going wrong with the execution in terms of cache efficiency.

The detailed stat from perf stat -d is:

   2711.191150 task-clock                #    2.978 CPUs utilized          
         1,421 context-switches          #    0.524 K/sec                  
            50 cpu-migrations            #    0.018 K/sec                  
       362,533 page-faults               #    0.134 M/sec                  
 8,518,897,738 cycles                    #    3.142 GHz                     [40.13%]
 6,089,067,266 stalled-cycles-frontend   #   71.48% frontend cycles idle    [39.76%]
 4,419,565,197 stalled-cycles-backend    #   51.88% backend  cycles idle    [39.37%]
 7,095,514,317 instructions              #    0.83  insns per cycle        
                                         #    0.86  stalled cycles per insn [49.66%]
   858,812,708 branches                  #  316.766 M/sec                   [49.77%]
     3,282,725 branch-misses             #    0.38% of all branches         [50.19%]
 1,899,797,603 L1-dcache-loads           #  700.724 M/sec                   [50.66%]
   153,927,756 L1-dcache-load-misses     #    8.10% of all L1-dcache hits   [50.94%]
    45,287,408 LLC-loads                 #   16.704 M/sec                   [40.70%]
    26,011,069 LLC-load-misses           #   57.44% of all LL-cache hits    [40.45%]

   0.910380914 seconds time elapsed

Here frontend and backend stalled cycles are also high and the lower level caches seem to suffer from a high miss rate of 57.5%.

Which metric is the most appropriate for this scenario? One idea I was thinking is that it could be the case that the workload no longer requires further "touching" of the LL caches after the initial image load (loads the values once and after that it's done - the workload is more CPU-bound than memory-bound being an image filtering algorithm).

The machine I'm running this on is a Xeon E5-2680 (20M of Smart cache, out of which 256KB L2 cache per core, 8 cores).

1
koukouviou, perf stat -d may be inaccurate, it can be useful to rerun it several times with 4-5 events per run, perf stat -e L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-loads-misses. If your stages can be separated, you may measure every stage to find which one generates misses. Or you can run perf record -e cache-misses to get profile of code which has most misses (as was recommended in the "Here" you linked).osgx
@osgx I have done the tests you mention but the post was already lengthy so I kept it as brief as I could. Although the stats change a bit, remain statistically similar with high cache misses %. I've run perf record/report/annotate and the misses are attributed largely to kernel-kallsyms, but I don't know what to make of that...koukouviou
koukouviou, You may rerun your stat with all events limited to user-space: perf stat -e event1:u,event2:u,... and compare with posted results (by default both user and kernel are measured). kernel-kallsyms mean that some function from kernel-space had this event... Some possible issues of unknown/unresolved symbols are listed here: brendangregg.com/blog/2014-06-22/perf-cpu-sample.html. I can't give you advises about metrics....osgx

1 Answers

3
votes

The first thing you want to make sure is that no other compute intensive process is running on your machine. That's a server CPU so I thought that could be a problem.

If you use multi-threading in your program, and you distribute equal amount of work between threads, you might be interested collecting metrics only on one CPU.

I suggest disabling hyper-threading in the optimization phase as it can lead to confusion when interpreting the profiling metrics. (e.g. increased #cycles spent in the back-end). Also if you distribute work to 3 threads, you have a high chance that 2 threads will share the resources of one core and the 3rd will have the entire core for itself - and it will be faster.

Perf has never been very good at explaining the metrics. Judging by the order of magnitude, the cache references are the L2 misses that hit the LLC. A high LLC miss number compared with LLC references is not always a bad thing if the number of LLC references / #Instructions is low. In your case, you have 0.018 so that means that most of your data is being used from L2. The high LLC miss ratio means that you still need to get data from RAM and write it back.

Regarding #Cycles BE and FE bound, I'm a bit concerned about the values because they don't sum to 100% and to the total number of cycles. You have 8G but staying 6G cycles in the FE and 4G cycles in the BE. That does not seem very correct.

If the FE cycles is high, that means you have misses in the instruction cache or bad branch speculation. If the BE cycles is high, that means you wait for data.

Anyway, regarding your question. The most relevant metric to asses the performance of your code is Instructions / Cycle (IPC). Your CPU can execute up to 4 instructions / cycle. You only execute 0.8. That means resources are underutilized, except for the case where you have many vector instructions. After IPC you need to check branch misses and L1 misses (data and code) because those generate most penalties.

A final suggestion: you may be interested in trying Intel's vTune Amplifier. It gives a much better explaining on the metrics and points you to the eventual problems in your code.