For an assignment at school, I'm performing an intensive operation on a very large array of numbers. While benchmarking a single-threaded version operating on the whole array and comparing my results to my classmate's, I noticed some strange behavior.
The function is as follows:
int compute (char a[], int start, int end) {
int sum = 0;
int min = a[start];
int max = a[start];
for (int i = start; i < end; i++) {
if (a[i] > max) max = a[i];
if (a[i] < min) min = a[i];
int cube = a[i] * a[i] * a[i];
sum += cube;
}
return sum;
}
But my classmate's program is consistently running faster, often much faster. His code is identical, except for the order of the instructions in the loop body:
for (int i = start; i < end; i++) {
int cube = a[i] * a[i] * a[i];
sum += cube;
if (a[i] > max) max = a[i];
if (a[i] < min) min = a[i];
}
Here's the output comparing the runtime of each version with an input array of size 1,000,000,000 (initialized with random signed bytes):
Min/max first:
sum = 5445493143089, min = -128, max = 127
Completed in 1.050268 sec
Product-sum first:
sum = 5445493143089, min = -128, max = 127
Completed in 1.010639 sec
We have inspected the generated assembly for both versions and noticed the same instructions present, simply ordered differently. To my knowledge, this shouldn't have as significant an effect as it does, but I could be wrong. (We also noticed that the registers used differed greatly, but this I especially doubt should have an effect.)
We encounter this behavior when compiling for both C (-std=c11
) and C++ (-std=c++11
).
Why is the order of those lines heavily affecting the behavior of the sequential program? We are also benchmarking a parallel version of the operation, and in contrast, its behavior is almost unchanged. I looked into memory reordering as a possible culprit, but that doesn't appear to be the problem since the parallel version is virtually unaffected (and there's no overlap in the partitions anyway).
Intensive back-to-back tests demonstrating the behavior. Product-sum is always faster than min/max, even in alternation and allowing for caching.
IdeOne
and at my computer with-O3
flag - ideone.com/1PqJIf – borisbn