Optimization via loop blocking in C

Question

I'm currently studying C optimizations and had a past assignment optimizing a piece of code. Among other optimizations (unrolling loops and strength reduction) I used blocking according to cache size (following Intel's tutorial on the matter):

https://software.intel.com/en-us/articles/how-to-use-loop-blocking-to-optimize-memory-use-on-32-bit-intel-architecture.

Now I think I understand why this technique works in that case, where the stride is one, it loads into cache the blocks and reduced the number of misses when accessing the next place in memory. But in my code dst[dim * jj + ii] seems to jump around all over the place since it is being multiplied by jj in the innermost loop. How does the cache account for this? dim is multiplied by 0 then 1 then 2 etc at some point it will surpass what the block can hold and the optimization will be pointless. Am I understanding this right?

In practice, however, when I used blocking only for the jj variable I didn't get the speed up in performance I did from using blocking on both ii and jj. So I made it faster but don't know why. The assignment is past now, but I still don't understand and it's quite frustrating. Thank you in advance for bearing with what may be a very stupid question.

   void transpose(int *dst, int *src, int dim)
   {
      int i, j, dimi, jj,ii;
      dimi = 0;
      for(i=0; i < dim; i+=block_size)
      {
        for(j=0; j<dim; j+=block_size)
        {
          for(ii = i; ii < i+block_size; ii++)
          {
            dimi = dim * ii;
            for(jj = j; jj < j+block_size; jj++)
            {
              dst[dim*jj + ii] =  src[dimi + jj];
            }
          }
        }
      }
    }

What compiler with what optimization level, on what hardware? gcc -O3 will auto-vectorize, but IDK if it will do a good job here. Contiguous stores may be better than contiguous loads; then you could vectorize with vector load + shuffle, and vector store. A compiler could vectorize over the ii loop (2nd-from-inner) instead of inner, though, and produce 4 dst[] result in parallel. (Except gcc for x86 does a vector load and unpacks it into 4 stores. It could check for overlap between src and dst, but doesn't. godbolt.org/g/g24ehr IDK what clang is doing with a 256-byte copy) — Peter Cordes

Peter Cordes Peter Cordes · Accepted Answer · 2018-06-04T00:53:16

You have poor spatial locality in dst, but with blocking for both dimensions there's still enough locality in time and space combined that cache lines are typically still hot in L1d cache when you store the next int.

Let's say that dst[dim*jj + ii] is the first int in a cache line. The store to dst[dim*jj + ii + 1] will be in the same cache line. If that line is still hot in L1d cache, the CPU hasn't spent any bandwidth on evicting the dirty line do L2 and then bringing it back into L1d for the next store.

With blocking for both dimensions, that next store will happen after block_size more stores to dst[ dim*(jj+1..block_size-1) + ii ]. (Next iteration of the ii loop.)

If dim and block_size are both powers of 2, the line will probably be evicted because of conflicts. Addresses 4kiB apart go to the same set in L1d, although the problematic stride is larger for L2. (Intel's L1d caches are 32kiB and 8-way set associative, so as few as 8 more stores to the same set will probably evict a line. But L3 cache uses a hash function for set indexing, instead of a simple modulo by using a range of address bits directly. IDK how bit your buffers are, or your whole matrix can stay hot in your L3 cache.)

But if either dim or block_size aren't a power of 2, then all 64 sets of 8 lines of 64 bytes (L1d) come into play. So up to 64*8 = 512 dirty lines could be in L1d cache. But remember there's still the data being loaded sequentially, and that will take some space. (Not much, because you're reading 16 ints consecutively from each line of loaded data, and using that to dirty 16 different lines of destination data.)

With blocking only in 1 dimension, you're doing many more stores before you come back to a destination line, so it will probably have been evicted to L2 or maybe L3 by then.

BTW, I put your code on the Godbolt compiler explorer (https://godbolt.org/g/g24ehr), and gcc -O3 for x86 doesn't try to do anything special. It uses a vector load into an XMM register, and unpacks with shuffles and does 4 separate int stores.

clang6.0 does something interesting, involving copying a block of 256 bytes. IDK if it's doing this to work around aliasing (because without int *restrict dst it doesn't know that src and dst don't overlap).

BTW, contiguous writes and scattered reads would probably be better. (i.e. invert your inner two loops, so ii changes in the inner-most loop instead of jj). Evicting a dirty cache line is more expensive than evicting a clean line and just re-reading it again later.

Optimization via loop blocking in C

1 Answers