I have access to a dual-socket system consisting of two NUMA nodes to do some data processing.
My code is relatively straightforward and I'm using openMP for a main parallelizable loop that looks like this (k is a function parameter and buffer is a multi-gigabytes array of length n):
uint64_t m=0;
uint64_t *rk = (uint64_t *) calloc(k, sizeof(uint64_t));
#pragma omp parallel
{
#pragma omp for reduction(+:m), reduction(+:rk[:k])
for (uint64_t i=0; i<n-k; i++)
{
m += (uint64_t)buffer[i];
for (uint64_t j=0; j<k; j++)
{
rk[j] += (uint64_t)buffer[i]*(uint64_t)buffer[i+j];
}
}
/* Other stuff, serial and parallel */
}
Under Linux Mint I can compile with gcc without problem and all of the cores on both sockets are put to good use. However, on Windows (mingw-gcc on cygwin) only a single NUMA node is used. Since my code isn't really sensitive to the memory latency, I get 2x slowdown on Windows.
I can't figure out how to force Windows to spread the threads on both nodes. As far as I understand, openMP doesn't support affinity on Windows (cygwin mingw-gcc implementation anyways), but I don't know how I should do it manually.
Any help is greatly appreciated!