How to implement base 2 loop unrolling at run-time for optimization purposes

Question

Consider some code that needs to executed repeatedly anywhere between 1-1,000,000 times, and that the number of repeats is not known at compile time. It is my understanding that loop unrolling would be a negligible optimization given the large number of loops and would only be optimizing up to the max_unrolls specified at compile time. The idea that I have come up with is to implement a binary (or base 2) partial loop un-roller that essentially executes some_function repeatedly a number of times that is specified at run-time. I have come up with some code to demonstrate the idea, an condensed version is show below. The method used in the code below has a number of issues from a usability perspective.

It requires the coder to manually copy out the base 2 unroll essentially copy out the unroll 2^n-1 times.
This also needs to be re-done for every new function that needs to use this method.

My question is threefold. Firstly am I missing something, is the compiler already intelligent enough to do this itself? Secondly what is an efficient way of implementing this so that it becomes feasible to benchmark this against a standard for loop, whilst solving the above issues. Thirdly to your knowledge is there a library that has already implemented this.

Please note: I am doing this purely for the fun of it, but do not have the expertise to know if this will be effective. I have done testing on the code but have only found very small improvements, however I believe I have not been able to unroll manually far enough for a fair comparison to be made. Also I am aware that this method has the potential to create massive binary sizes, however I believe this would be a worthwhile time memory trade off. Also, if you post any assembly it will probably take me another year or so to understand it.

inline void some_reapeated_function(int &function_parameter_1, int &function_parameter_2)
{
    function_parameter_1 /= function_parameter_2;
}

// Function to be called when you need it to be unrolled.
int runtime_unroll(unsigned int &no_of_loops, int &function_parameter_1, int &function_parameter_2)
{
    std::vector<bool> binary_vector;

    // Stores the number of loops in a binary representation in a vector.
    binary_function.reserve(no_of_loops);
     while(no_of_loops) 
    {
        if (no_of_loops&1)
          binary_vector.push_back(false);
        else
          binary_vector.push_back(true);
        no_of_loops>>=1;
    } 

    // If binary of no_of_loops contains a 2^0 execute once.
    if (binary_vector[0])
    {
        some_reapeated_function(function_parameter_1,function_parameter_2);
    }
    // If binary of no_of_loops contains a 2^1 execute twice.
    if (binary_vector[1])
    {
        some_reapeated_function(function_parameter_1,function_parameter_2);
        some_reapeated_function(function_parameter_1,function_parameter_2);
    }
    //If binary of no_of_loops contains a 2^2 execute 4 times.
    if (binary_vector[2])
    {
        some_reapeated_function(function_parameter_1,function_parameter_2);
        some_reapeated_function(function_parameter_1,function_parameter_2);
        some_reapeated_function(function_parameter_1,function_parameter_2);
        some_reapeated_function(function_parameter_1,function_parameter_2);
    }


    /* This example only covers from 1 to 2^3-1 or up to 7 unrolls. 
    This can continue depending on the number of repetitions needed and 
    could incorporate a for loop to continue after the loop has fully unrolled */
}

The more standard practice is a partially unrolled loop that performs some number of operations (e.g. 8) each iteration, and a second loop that does the remainder. Past a certain loop size, unrolling further won't have a noticeable benefit, and will probably make things worse due to using more of the instruction cache. You can also try Duff's device to get rid of the second loop. — interjay
@interjay The idea behind the function I am trying to implement is to make the number of unrolls completely flexible and thus optimizing the loop no matter the number of iterations needed. Imagine an int in binary form where each of the 1's represent a function being executed to 2^n times where n is the index of the each bit within the integer? Duffs device is interesting though, thankyou :) — silvergasp

stgatilov stgatilov · Accepted Answer · 2015-11-27T14:16:37

You can easily implement something like this with C++ templates. Note that you are still at mercy of your compiler: there is no guarantee all the function calls would get inlined. If they are not, you can try using __forceinline keyword (or its equivalent).

First of all, you need an unroller which takes a function as argument and executes it K times in a fully unrolled loop. The function call must be inlined, so you have to use functor objects instead of function pointers or std::function-s, and functor's type must be template. The unroller itself can be implemented as a recursive loop by integer template argument. Since functions in C++ cannot have partial template specialization, we have to make our unroller a template class. Here is sample code:

// execute UnrollCnt times in unrolled fashion
template<int UnrollCnt, class Functor> struct Unroller {
    static inline void Run(int base, const Functor &func) {
        func(base);
        Unroller<UnrollCnt - 1, Functor>::Run(base + 1, func);
    }
};
template<class Functor> struct Unroller<0, Functor> {
    static inline void Run(int base, const Functor &func) {
    }
};

Given the unroller, we can easily implement the unrolled loop. If we have N iterations, then we can call our unroller [N/K] times, then perform a few remaining calls as usual. Note that functor's type must still be template here. Here is the code:

// execute with argument in range [begin, end)
template<int UnrollCnt, class Functor>
void UnrolledFor(int begin, int end, const Functor &func) {
    // iterations with unrolling
    int iter = begin;
    for (; iter <= end - UnrollCnt; iter += UnrollCnt)
        Unroller<UnrollCnt, Functor>::Run(iter, func);
    // last iterations without unrolling
    for (; iter < end; iter++)
        func(iter);
}

Now we can call the UnrolledFor loop for any function accepting a single argument, being the loop's iteration count. For example, we can compute sum of numbers from 0 to N-1:

long long ans = 0;
int main() {
    int cnt = 0;
    scanf("%d", &cnt);
    int start = clock();
    // note: passing a lambda function here, requesting 8x unrolling
    UnrolledFor<8>(0, cnt, [](int i) {
        ans += i;
    });
    int elapsed = clock() - start;
    printf("%lld   (%d pg)\n", ans, elapsed);
    return 0;
}

Note however, that manual unrolling may work much faster, because the thick level of abstraction here is not trivial to cut through for the compiler. For example, here are some timings I observe for the sample code (with N = 2000000000):

With MSVC 2013 x64:
1999999999000000000   (421 pg)   // 8x unrolling, ans is global
1999999999000000000   (1389 pg)  // 1x unrolling, ans is global
1999999999000000000   (4664 pg)  // 8x unrolling, ans is local
1999999999000000000   (1388 pg)  // 1x unrolling, ans is local
With MinGW GCC 5.1.0 x64:
1999999999000000000   (1388 pg)  // 1x unrolling, ans is global
1999999999000000000   (1404 pg)  // 8x unrolling, ans is global
1999999999000000000   (1389 pg)  // 1x unrolling, ans is local
1999999999000000000   (1393 pg)  // 8x unrolling, ans is local

As you see, only MSVC with global ans variable did really win from unrolling. But with local ans variable (captured by reference) it got several times slower instead.

So if you are really obsessed with performance, I suggest using macros for loop unrolling, they add absolutely no overhead.

How to implement base 2 loop unrolling at run-time for optimization purposes

1 Answers