Cache Performance (concerning loops) in C

Question

I was wondering, why does one set of loops allow for better cache performance than another in spite of logically doing the same thing?

for (i = 0; i < n; i++) {
    for (j = 0; j < n; j++) {
        accum = 0.0;
        for (k = 0; k < n; k++) {
            accum += b[j][k] * a[k][i];
        }
        c[j][i] = accum;
    }
}

for (j = 0; j < n; j++) {
    for (k = 0; k < n; k++) {
        val = b[j][k];
        for (i = 0; i < n; i++) {
            c[j][i] += val * a[k][i];
        }
    }
}

I believe the first one above delivers better cache performance, but why?

Also, when we increase block size, but keep cache size and associativity constant, does it influence the miss rate? At a certain point increasing block size can cause a higher miss rate, right?

When you say "I believe the first one delivers better cache performance but why??", have you profiled the code? — Iharob Al Asimi
Why do you believe the first delivers better cache performance if you have not tested and you have no reason why it might do so? Might it be that your homework problem asserts so? We are sometimes willing to help with homework, but I, at least, dislike HW questions in disguise. — John Bollinger
If this is an abstract, academic question, it probably belongs in cs.stackexchange.com, not SO. — Barmar
@user3528438 I understand, however the practice problem states "assume associativity and cache size is constant", I guess I forgot to include that in the post. — user5411184

Unknown Unknown · Accepted Answer · 2015-12-16T09:12:53

Just generally speaking, the most efficient loops through a matrix are going to cycle through the last dimension, not the first ("last" being c in m[a][b][c]).

For example, given a 2D matrix like an image which has its pixels represented in memory from top-left to bottom-right, the quickest way to sequentially iterate through it is going to be horizontally across each scanline, like so:

for (int y=0; y < h; ++y) {
    for (int x=0; x < w; ++x)
        // access pixel[y][x]
}

... not like this:

for (int x=0; x < w; ++x) {
    for (int y=0; y < h; ++y)
        // access pixel[y][x]
}

... due to spatial locality. It's because the computer grabs memory from slower, bigger regions of the hierarchy and moves it to faster, small regions in large, aligned chunks (ex: 64 byte cache lines, 4 kilobyte pages, and down to a little teeny 64-bit general-purpose register, e.g.). The first example accesses all the data from such a contiguous chunk immediately and prior to eviction.

harold on this site gave me a nice view on how to look at and explain this subject by suggesting not to focus so much on cache misses, but instead focusing on striving to use all the data in a cache prior to eviction. The second example fails to do that for all but the most trivially-small images by iterating through the image vertically with a large, scanline-sized stride rather than horizontally with a small, pixel-sized one.

Also, when we increase block size, but keep cache size and associativity constant, does it influence the miss rate? At a certain point increasing block size can cause a higher miss rate, right?

The answer here would be "yes", as an increase in block size would naturally equate to more compulsory misses (that would be more simply "misses" though rather than "miss rate") but also just more data to process which won't all necessarily fit into the fastest L1 cache. If we're accessing a large amount of data with a large stride, we end up getting a higher non-compulsory miss rate as a result of more data being evicted from the cache before we utilize it, only to then redundantly load it back into a faster cache.

There is also a case where, if the block size is small enough and aligned properly, all the data will just fit into a single cache line and it wouldn't matter so much how we sequentially access it.

Matrix Multiplication

Now your example is quite a bit more complex than this straightforward image example above, but the same concepts tend to apply.

Let's look at the first one:

for (i = 0; i < n; i++) {
    for (j = 0; j < n; j++) {
        accum = 0.0;
        for (k = 0; k < n; k++)
            accum += b[j][k] * a[k][i];
        c[j][i] = accum;
    }
}

If we look at the innermost k loop, we access b[j][k]. That's a fairly optimal access pattern: "horizontal" if we imagine a row-order memory layout. However, we also access a[k][i]. That's not so optimal, especially for a very large matrix, as it's accessing memory in a vertical pattern with a large stride and will tend to suffer from data being evicted from the fastest but smallest forms of memory before it is used, only to load that chunk of data again redundantly.

If we look at the second j loop, that's accessing c[j][i], again in a vertical fashion which is not so optimal.

Now let's have a glance at the second example:

for (j = 0; j < n; j++) {
    for (k = 0; k < n; k++) {
        val = b[j][k];
        for (i = 0; i < n; i++)
            c[j][i] += val * a[k][i];
    }
}

If we look at the second k loop in this case, it's starting off accessing b[j][k] which is optimal (horizontal). Furthermore, it's explicitly memoizing the value to val, which might improve the odds of the compiler moving that to a register and keeping it there for the following loop (this relates to compiler concepts related to aliasing, however, rather than CPU cache).

In the innermost i loop, we're accessing c[j][i] which is also optimal (horizontal) along with a[k][i] which is also optimal (horizontal).

So this second version is likely to be more efficient in practice. Note that we can't absolutely say that, as aggressive optimizing compilers can do all sorts of magical things like rearranging and unrolling loops for you. Yet short of that, we should be able to say the second one has higher odds of being more efficient.

"What's a profiler?"

I just noticed this question in the comments. A profiler is a measuring tool that can give you a precise breakdown of where time is spent in your code, along with possibly further statistics like cache misses and branch mispredictions.

It's not only good for optimizing real-world production code and helping you more effectively prioritize your efforts to places that really matter, but it can also accelerate the learning process of understanding why inefficiencies exist through the process of chasing one hotspot after another.

Loop Tiling/Blocking

It's worth mentioning an advanced optimization technique which can be useful for large matrices -- loop tiling/blocking. It's beyond the scope of this subject but that one plays to temporal locality.

Deep C Optimization

Hopefully later you will be able to C these things clearly as a deep C explorer. While most optimization is best saved for hindsight with a profiler in hand, it's useful to know the basics of how the memory hierarchy works as you go deeper and deeper exploring the C.