22
votes

On a mailing list I'm subscribed to, two fairly knowledgeable (IMO) programmers were discussing some optimized code, and saying something along the lines of:

On the CPUs released 5-8 years ago, it was slightly faster to iterate for loops backwards (e.g. for (int i=x-1; i>=0; i--) {...}) because comparing i to zero is more efficient than comparing it to some other number. But with very recent CPUs (e.g. from 2008-2009) the speculative loader logic is such that it works better if the for loop is iterated forward (e.g. for (int i=0; i< x; i++) {...}).

My question is, is that true? Have CPU implementations changed recently such that forward-loop-iterating now has an advantage over backward-iterating? If so, what is the explanation for that? i.e. what changed?

(Yes, I know, premature optimization is the root of all evil, review my algorithm before worrying about micro-optimizations, etc etc... mostly I'm just curious)

7
What is a "speculative loader" ? Google returns a handful of hits with this article at the top. I'm guessing it's one of those CPU technologies that does parts of future instructions in advance?Edmund

7 Answers

30
votes

You're really asking about prefetching, not about loop control logic.

In general, loop performance isn't going to be dictated by the control logic (i.e. the increment/decrement and the condition that gets checked every time through). The time it takes to do these things is inconsequential except in very tight loops. If you're interested in that, take a look at John Knoeller's answer for specifics on the 8086's counter register and why it might've been true in the old days that counting down was more efficient. As John says, branch prediction (and also speculation) can play a role in performance here, as can instruction prefetching.

Iteration order can affect performance significantly when it changes the order in which your loop touches memory. The order in which you request memory addresses can affect what is drawn into your cache and also what is evicted from your cache when there is no longer room to fetch new cache lines. Having to go to memory more often than needed is much more expensive than compares, increments, or decrements. On modern CPUs it can take thousands of cycles to get from the processor to memory, and your processor may have to idle for some or all of that time.

You're probably familiar with caches, so I won't go into all those details here. What you may not know is that modern processors employ a whole slew of prefetchers to try to predict what data you're going to need next at different levels of the memory hierarchy. Once they predict, they try to pull that data from memory or lower level caches so that you have what you need when you get around to processing it. Depending on how well they grab what you need next, your performance may or may not improve when using them.

Take a look at Intel's guide to optimizing for hardware prefetchers. There are four prefetchers listed; two for NetBurst chips:

  1. NetBurst's hardware prefetcher can detect streams of memory accesses in either forward or backward directions, and it will try to load data from those locations into the L2 cache.
  2. NetBurst also has an adjacent cache line (ACL) prefetcher, which will automatically load two adjacent cache lines when you fetch the first one.

and two for Core:

  1. Core has a slightly more sophisticated hardware prefetcher; it can detect strided access in addition to streams of contiguous references, so it'll do better if you step through an array every other element, every 4th, etc.
  2. Core also has an ACL prefetcher like NetBurst.

If you're iterating through an array forward, you're going to generate a bunch of sequential, usually contiguous memory references. The ACL prefetchers are going to do much better for forward loops (because you'll end up using those subsequent cache lines) than for backward loops, but you may do ok making memory references backward if the prefetchers can detect this (as with the hardware prefetchers). The hardware prefetchers on the Core can detect strides, which is helpful for for more sophisticated array traversals.

These simple heuristics can get you into trouble in some cases. For example, Intel actually recommends that you turn off adjacent cache line prefetching for servers, because they tend to make more random memory references than desktop user machines. The probability of not using an adjacent cache line is higher on a server, so fetching data you're not actually going to use ends up polluting your cache (filling it with unwanted data), and performance suffers. For more on addressing this kind of problem, take a look at this paper from Supercomputing 2009 on using machine learning to tune prefetchers in large data centers. Some guys at Google are on that paper; performance is something that is of great concern to them.

Simple heuristics aren't going to help you with more sophisticated algorithms, and you might have to start thinking about the sizes of your L1, L2, etc. caches. Image processing, for example, often requires that you perform some operation on subsections of a 2D image, but the order you traverse the image can affect how well useful pieces of it stay in your cache without being evicted. Take a look at Z-order traversals and loop tiling if you're interested in this sort of thing. It's a pretty basic example of mapping the 2D locality of image data to the 1D locality of memory to improve performance. It's also an area where compilers aren't always able to restructure your code in the best way, but manually restructuring your C code can improve cache performance drastically.

I hope this gives you an idea of how iteration order affects memory performance. It does depend on the particular architecture, but the ideas are general. You should be able to understand prefetching on AMD and Power if you can understand it on Intel, and you don't really have to know assembly to structure your code to take advantage of memory. You just need to know a little computer architecture.

12
votes

I don't know. But I do know how to write a quick benchmark with no guarantees of scientific validity (actually, one with rather strict guarantees of invalidity). It has interesting results:

#include <time.h>
#include <stdio.h>

int main(void)
{
    int i;
    int s;
    clock_t start_time, end_time;
    int centiseconds;

    start_time = clock();
    s = 1;
    for (i = 0; i < 1000000000; i++)
    {
        s = s + i;
    }
    end_time = clock();
    centiseconds = (end_time - start_time)*100 / CLOCKS_PER_SEC;
    printf("Answer is %d; Forward took %ld centiseconds\n", s, centiseconds);

    start_time = clock();
    s = 1;
    for (i = 999999999; i >= 0; i--)
    {
        s = s + i;
    }
    end_time = clock();
    centiseconds = (end_time - start_time)*100 / CLOCKS_PER_SEC;
    printf("Answer is %d; Backward took %ld centiseconds\n", s, centiseconds);

    return 0;
}

Compiled with -O9 using gcc 3.4.4 on Cygwin, running on an "AMD Athlon(tm) 64 Processor 3500+" (2211 MHz) in 32 bit Windows XP:

Answer is -1243309311; Forward took 93 centiseconds
Answer is -1243309311; Backward took 92 centiseconds

(Answers varied by 1 either way in several repetitions.)

Compiled with -I9 using gcc 4.4.1 running on an "Intel(R) Atom(TM) CPU N270 @ 1.60GHz" (800 MHz and presumably only one core, given the program) in 32 bit Ubuntu Linux.

Answer is -1243309311; Forward took 196 centiseconds
Answer is -1243309311; Backward took 228 centiseconds

(Answers varied by 1 either way in several repetitions.)

Looking at the code, the forward loop is translated to:

; Gcc 3.4.4 on Cygwin for Athlon      ; Gcc 4.4.1 on Ubuntu for Atom
L5:                                .L2:
    addl    %eax, %ebx                 addl    %eax, %ebx
    incl    %eax                       addl    $1, %eax
    cmpl    $999999999, %eax           cmpl    $1000000000, %eax
    jle     L5                         jne     .L2

The backward to:

L9:                                .L3:
    addl    %eax, %ebx                 addl    %eax, %ebx
    decl    %eax                       subl    $1, $eax
    jns     L9                         cmpl    $-1, %eax
                                       jne .L3

Which shows, if not much else, that GCC's behaviour has changed between those two versions!

Pasting the older GCC's loops into the newer GCC's asm file gives results of:

Answer is -1243309311; Forward took 194 centiseconds
Answer is -1243309311; Backward took 133 centiseconds

Summary: on the >5 year old Athlon, the loops generated by GCC 3.4.4 are the same speed. On the newish (<1 year?) Atom, the backward loop is significantly faster. GCC 4.4.1 has a slight regression for this particular case which I personally am not bothered about in the least, given the point of it. (I had to make sure that s is used after the loop, because otherwise the compiler would elide the computation altogether.)

[1] I can never remember the command for system info...

7
votes

Yes. but with a caveat. The idea that looping backwards is faster never applied to all older CPUs. It's an x86 thing (as in 8086 through 486, possibly Pentium, although I don't think any further).

That optimization never applied to any other CPU architecture that I know of.

Here's why.

The 8086 had a register that was specifically optimized for use as a loop counter. You put your loop count in CX, and then there are several instructions that decrement CX and then set condition codes if it goes to zero. In fact there was an instruction prefix you could put before other instructions (the REP prefix) that would basically iterate the other instruction until CX got to 0.

Back in the days when we counted instructions and instructions had known fixed cycle counts using cx as your loop counter was the way to go, and cx was optimized for counting down.

But that was a long time ago. Ever since the Pentium, those complex instructions have been slower overall than using more, and simpler instructions. (RISC baby!) The key thing we try to do these days is try to put some time between loading a register and using it because the pipelines can actually do multiple things per cycle as long as you don't try to use the same register for more than one thing at a time.

Nowdays the thing that kills performance isn't the comparison, it's the branching, and then only when the branch prediction predicts wrong.

5
votes

I stumbled upon this question after observing a significant drop in performance when iterating over an array backwards vs forwards. I was afraid it would be the prefetcher, but the previous answers convinced me this was not the case. I then investigated further and found out that it looks like GCC (4.8.4) is unable to exploit the full power of SIMD operations in a backward loop.

In fact, compiling the following code (from here) with -S -O3 -mavx:

  for (i = 0; i < N; ++i)
    r[i] = (a[i] + b[i]) * c[i];

leads to essentially:

.L10:
    addl    $1, %edx
    vmovupd (%rdi,%rax), %xmm1
    vinsertf128     $0x1, 16(%rdi,%rax), %ymm1, %ymm1
    vmovupd (%rsi,%rax), %xmm0
    vinsertf128     $0x1, 16(%rsi,%rax), %ymm0, %ymm0
    vaddpd  (%r9,%rax), %ymm1, %ymm1
    vmulpd  %ymm0, %ymm1, %ymm0
    vmovupd %xmm0, (%rcx,%rax)
    vextractf128    $0x1, %ymm0, 16(%rcx,%rax)
    addq    $32, %rax
    cmpl    %r8d, %edx
    jb      .L10

i.e. assembly code that uses the AVX extensions to perform four double operations in parallel (for example, vaddpd and vmulpd).

Conversely, the following code compiled with the same parameters:

  for (i = 0; i < N; ++i)
    r[N-1-i] = (a[N-1-i] + b[N-1-i]) * c[N-1-i];

produces:

.L5:
    vmovsd  a+79992(%rax), %xmm0
    subq    $8, %rax
    vaddsd  b+80000(%rax), %xmm0, %xmm0
    vmulsd  c+80000(%rax), %xmm0, %xmm0
    vmovsd  %xmm0, r+80000(%rax)
    cmpq    $-80000, %rax
    jne     .L5

which only performs one double operation at the time (vaddsd, vmulsd).

This fact alone may be responsible for a factor of 4 between the performance when iterating backward vs forward.

Using -ftree-vectorizer-verbose=2, it looks like the problem is storing backwards: "negative step for store". In fact, if a, b, and c are read backwards, but r is written into in the forward direction, and the code is vectorized again.

1
votes

It probably doesn't make a hoot of difference speed-wise, but I often write:

for (i = n; --i >= 0; ) blah blah

which I think at one time generated cleaner assembly.

Of course, in answering this kind of question, I run the risk of affirming that this is important. It's a micro-optimization kind of question, which is closely related to premature optimization, which everybody says you shouldn't do, but nevertheless SO is awash in it.

0
votes

No, we can't say that CPU implementations have changed to make forward looping faster. And that has very little to do with the CPUs themselves.

It has to do with the fact that you haven't specified which CPU you're talking about, nor which compiler.

You cannot ask a blanket question about CPU issues with the C tag and expect to get an intelligent answer simply because nothing in the C standard mandates how fast CPU s should be at various operations.

If you'd like to rephrase your question to target a specific CPU and machine language (since what machine language you get out of a C compiler depends entirely on the compiler), you may get a better answer.

In either case, it shouldn't matter. You should be relying on the fact that the people that wrote your compiler know a great deal more than you about how to eke the last inch of performance from the various CPUs.

The direction in which you should be iterating has always been dictated by what you have to do. For example, if you have to process array elements in ascending order, you use:

for (i = 0; i < 1000; i++) { process (a[i]); }

rather than:

for (i = 999; i >= 0; i--) { process (a[999-i]); }

simply because any advantage you may gain in going backwards is more than swamped by the extra calculations on i. It may well be that a naked loop (no work done in the body) may be faster in one direction than another but, if you have such a naked loop, it's not doing any real work anyway.

As an aside, it may well be that both those loops above will come down to the same machine code anyway. I've seen some of the code put out by the GCC optimizer and it made my head spin. Compiler writers are, in my opinion, a species alone when it comes to insane levels of optimization.

My advice: always program for readability first then target any specific performance problems you have ("get it working first, then get it working fast").

0
votes

When optimizing loops I'd rather look into loop unrolling (as it cuts down the number of comparisons vs. the exit value, and it may be optimized for parallel processing (MMX) depending on what goes on inside the loop).