14
votes

I wrote the following short C++ program to reproduce the false sharing effect as described by Herb Sutter:

Say, we want to perform a total amount of WORKLOAD integer operations and we want them to be equally distributed to a number (PARALLEL) of threads. For the purpose of this test, each thread will increment its own dedicated variable from an array of integers, so the process may be ideally parallelizable.

void thread_func(int* ptr)
{
    for (unsigned i = 0; i < WORKLOAD / PARALLEL; ++i)
    {
        (*ptr)++;
    }
}

int main()
{
    int arr[PARALLEL * PADDING];
    thread threads[PARALLEL];

    for (unsigned i = 0; i < PARALLEL; ++i)
    {
        threads[i] = thread(thread_func, &(arr[i * PADDING]));
    }
    for (auto& th : threads)
    {
        th.join();
    }
    return 0;
}

I think the idea is easy to grasp. If you set

#define PADDING 16

every thread will work on a separate cache line (assuming the length of a cache line to be 64 bytes). So the result will be linear increase of speedup until PARALLEL > # cores. If, on the other hand, PADDING is set to any value below 16, one should encounter severe contention, for at least two threads are now likely to operate on the same cache line, which however is protected by a built-in hardware mutex. We would expect our speedup not only to be sublinear in this case, but even to be always < 1, because of the invisible lock convoy.

Now, my first attempts nearly satisfied these expectations, yet the minimum value of PADDING needed to avoid false sharing was around 8 and not 16. I was quite puzzled for about half an hour until I came to the obvious conclusion, that there is no guarantee for my array to be aligned exactly to the beginning of a cache line inside main memory. The actual alignment may vary depending on many conditions, including the size of the array.

In this example, there is of course no need for us to have the array aligned in a special way, because we can just leave PADDING at 16 and everything works out fine. But one could imagine cases, where it does make a difference, whether a certain structure is aligned to a cache line or not. Hence, I added some lines of code to get some information about the actual alignment of my array.

int main()
{
    int arr[PARALLEL * 16];
    thread threads[PARALLEL];
    int offset = 0;

    while (reinterpret_cast<int>(&arr[offset]) % 64) ++offset;
    for (unsigned i = 0; i < PARALLEL; ++i)
    {
        threads[i] = thread(thread_func, &(arr[i * 16 + offset]));
    }
    for (auto& th : threads)
    {
        th.join();
    }
    return 0;
}

Despite this solution worked out fine for me in this case, I'm not sure if it would be a good approach in general. So here is my question:

Is there any common way to have objects in memory aligned to cache lines other than what I did in the above example?

(using g++ MinGW Win32 x86 v.4.8.1 posix dwarf rev3)

3
VirtualAlloc? It coughs up pages, so must be aligned. - Martin James
I'm quite surprised that you're seeing any difference at all. The compiler should be keeping *ptr inside a register = thereby hiding the false-sharing penalty. - Mysticial
For the sake of learning, I turned of compiler optimization, so ptr must be dereferenced each time. - Rene R.
@Mysticial: I assumed he was running an optimized build, but that need not be the case. Also I am a bit surprised with the statement [required padding] was around 8 and not 16... that should be unrelated to the alignment inside the cache line. If the cache line is 64bytes and padding is 8, then there will be multiple ints in the same cache line, regardless of whether they are aligned with the cache line or not. Also, for 64bytes cache lines, and 4byte ints you would want padding to be 15, not 16 right? Each cache line being 1 value and 15 placeholders... - David Rodríguez - dribeas
@DavidRodríguez-dribeas That's correct. The alignment shouldn't matter. But in this case, it might just be enough to trick the compiler to do something less stupid. As is always the case - anything goes when optimizations are off. - Mysticial

3 Answers

14
votes

You should be able to request the required alignment from the compiler:

alignas(64) int arr[PARALELL * PADDING]; // align the array to a 64 byte line
4
votes

gcc supports an aligned keyword: http://gcc.gnu.org/onlinedocs/gcc/Variable-Attributes.html

You probably want something like this:

int arr[PARALLEL * 16] __attribute__ ((aligned (8)));

This aligns arr to an eight-byte boundary.

Visual Studio has a similar feature, too: http://msdn.microsoft.com/en-us/library/83ythb65.aspx

3
votes