How to use all NUMA nodes with openMP on Windows 10

Question

I have access to a dual-socket system consisting of two NUMA nodes to do some data processing.

My code is relatively straightforward and I'm using openMP for a main parallelizable loop that looks like this (k is a function parameter and buffer is a multi-gigabytes array of length n):

uint64_t m=0;
uint64_t *rk = (uint64_t *) calloc(k, sizeof(uint64_t));
#pragma omp parallel
{
    #pragma omp for reduction(+:m), reduction(+:rk[:k])
    for (uint64_t i=0; i<n-k; i++)
    {
        m += (uint64_t)buffer[i];
        for (uint64_t j=0; j<k; j++)
        {
            rk[j] += (uint64_t)buffer[i]*(uint64_t)buffer[i+j];
        }
    }
    /* Other stuff, serial and parallel */
}

Under Linux Mint I can compile with gcc without problem and all of the cores on both sockets are put to good use. However, on Windows (mingw-gcc on cygwin) only a single NUMA node is used. Since my code isn't really sensitive to the memory latency, I get 2x slowdown on Windows.

I can't figure out how to force Windows to spread the threads on both nodes. As far as I understand, openMP doesn't support affinity on Windows (cygwin mingw-gcc implementation anyways), but I don't know how I should do it manually.

Any help is greatly appreciated!

I've use gcc and mingw's gcc (x86_64-w64-mingw32-gcc), both under Cygwin. The results are the same. — JeanOlivier
I just tried with the Visual Studio 2017 compiler in command line (cl) and I have the same issue. I'll try the Intel compiler next, but I'm pessimistic. — JeanOlivier
Intel compiler uses all logical cores right out of the box! Although in my case I ended up sticking with gcc and assigning the threads to CPU groups manually. I'll look into hwloc to make it more portable. — JeanOlivier

JeanOlivier JeanOlivier · Accepted Answer · 2018-10-01T17:07:38

I found the cause of the issue. There is over 64 logical core on the machine, and as such Windows requires two CPU groups to address them. By default, it places each NUMA nodes in its own group.

The fix is either disabling HTT if you have less than 64 physical cores, or disabling the NUMA grouping in the bios. In the latter case, the first 64 logical cores will be grouped and appear as a single NUMA node in Windows and the remainder is placed in the second node. The ideal solution will depend on your specific application, whether you benefit from using all the cores, or from hyperthreadng..

[EDIT] You can also manage threads manually. If you want to do that, I suggest digging into Processtopologyapi.h and processthreadsapi.h, in particular into functions GetActiveProcessorCount and SetThreadGroupAffinity.

How to use all NUMA nodes with openMP on Windows 10

1 Answers