23
votes

For an assignment at school, I'm performing an intensive operation on a very large array of numbers. While benchmarking a single-threaded version operating on the whole array and comparing my results to my classmate's, I noticed some strange behavior.

The function is as follows:

int compute (char a[], int start, int end) {
    int sum = 0;
    int min = a[start];
    int max = a[start];

    for (int i = start; i < end; i++) {
        if (a[i] > max) max = a[i];
        if (a[i] < min) min = a[i];

        int cube = a[i] * a[i] * a[i];
        sum += cube;
    }

    return sum;
}

But my classmate's program is consistently running faster, often much faster. His code is identical, except for the order of the instructions in the loop body:

for (int i = start; i < end; i++) {
    int cube = a[i] * a[i] * a[i];
    sum += cube;

    if (a[i] > max) max = a[i];
    if (a[i] < min) min = a[i];
}

Here's the output comparing the runtime of each version with an input array of size 1,000,000,000 (initialized with random signed bytes):

Min/max first:
sum = 5445493143089, min = -128, max = 127
Completed in 1.050268 sec

Product-sum first:
sum = 5445493143089, min = -128, max = 127
Completed in 1.010639 sec

We have inspected the generated assembly for both versions and noticed the same instructions present, simply ordered differently. To my knowledge, this shouldn't have as significant an effect as it does, but I could be wrong. (We also noticed that the registers used differed greatly, but this I especially doubt should have an effect.)

We encounter this behavior when compiling for both C (-std=c11) and C++ (-std=c++11).

Why is the order of those lines heavily affecting the behavior of the sequential program? We are also benchmarking a parallel version of the operation, and in contrast, its behavior is almost unchanged. I looked into memory reordering as a possible culprit, but that doesn't appear to be the problem since the parallel version is virtually unaffected (and there's no overlap in the partitions anyway).

Intensive back-to-back tests demonstrating the behavior. Product-sum is always faster than min/max, even in alternation and allowing for caching.

2
Please show the generated assembly for both cases.Sander De Dycker
Just making sure - this is of course compiled with -O3 or similar, right?Griwes
Can you give a MCVE?edmz
@Purag The missing part isn't the minimal, but rather the complete part. Take an online compiler, and write up a test harness around the above. The test harness should be another 20-40 lines of code. Demonstrate that the error happens in the online version. Once done, someone can take your code and pay with the same apples, instead of having to recreate it from a description, end up with a watermelon, and get different results than you do. You are the best person to create a minimal, complete example, and it is your problem, so you should be the one who does the work!Yakk - Adam Nevraumont
I have the opposite results both at IdeOne and at my computer with -O3 flag - ideone.com/1PqJIfborisbn

2 Answers

8
votes

If we put explicit jumps into the code, you can see that the one with the conditionals at the end can avoid one jump most of the time. This is similar to the code that will actually be generated by the compiler.

First form, min/max first:

    int i = lo;
    goto start;
loop:
    i++;
start:
    if (!(i < hi)) goto end;
    if (!(a[i] > ret.max)) goto label1;
    ret.max = a[i];
label1:
    if (!(a[i] < ret.min)) goto label2;
    ret.min = a[i];
label2:

    long long square = a[i] * a[i];
    ret.sum += square;
    goto loop;
end:

Second form, min/max last:

    int i = lo;
    goto start;
loop:
    i++;
start:
    if (!(i < hi)) goto end;
    long long square = a[i] * a[i];
    ret.sum += square;

    if (!(a[i] > ret.max)) goto label1;
    ret.max = a[i];
label1:
    if (!(a[i] < ret.min)) goto loop;
    ret.min = a[i];
    goto loop;
end:
4
votes

It could be as simple as the processor jump prediction works better with the conditional jumps at the bottom of the loop...