16
votes

I am developing large dense matrix multiplication code. When I profile the code it sometimes gets about 75% of the peak flops of my four core system and other times gets about 36%. The efficiency does not change between executions of the code. It either starts at 75% and continues with that efficiency or starts at 36% and continues with that efficiency.

I have traced the problem down to hyper-threading and the fact that I set the number of threads to four instead of the default eight. When I disable hyper-threading in the BIOS I get about 75% efficiency consistently (or at least I never see the drastic drop to 36%).

Before I call any parallel code I do omp_set_num_threads(4). I have also tried export OMP_NUM_THREADS=4 before I run my code but it seems to be equivalent.

I don't want to disable hyper-threading in the BIOS. I think I need to bind the four threads to the four cores. I have tested some different cases of GOMP_CPU_AFFINITY but so far I still have the problem that the efficiency is 36% sometimes. What is the mapping with hyper-threading and cores? E.g. do thread 0 and thread 1 correspond to the the same core and thread 2 and thread 3 another core?

How can I bind the threads to each core without thread migration so that I don't have to disable hyper-threading in the BIOS? Maybe I need to look into using sched_setaffinity?

Some details of my current system: Linux kernel 3.13, GCC 4.8,Intel Xeon E5-1620 (four physical cores, eight hyper-threads).

Edit: This seems to be working well so far

export GOMP_CPU_AFFINITY="0 1 2 3 4 5 6 7"

or

export GOMP_CPU_AFFINITY="0-7"

Edit: This seems also to work well

export OMP_PROC_BIND=true

Edit: These options also work well (gemm is the name of my executable)

numactl -C 0,1,2,3 ./gemm

and

taskset -c 0,1,2,3 ./gemm
1
Since export GOMP_CPU_AFFINITY="0 1 2 3 4 5 6 7" gives good results I guess that means that thread 0 and 4 are core 0, thread 1 and 5 are core2, ... i.e. the threads are assigned like electrons in orbitals. It first fills each core (thread 0-3 )and when all cores have a thread it goes back and assign remain threads to the same core (threads 4-7).Z boson
Both hwloc-ls from the hwloc library or cpuinfo from Intel MPI provide essential topology information about the machine, e.g. mapping of logical CPU numbers to physical cores/threads. The numbering depends on the BIOS but in my experience most cases have been that hyperthreads are cycled in an "outer loop". Also, you could use the shorthand notation "0-7".Hristo Iliev
@HristoIliev, for portability it seems the right way to do this is to use OMP_PLACES, e.g. export OMP_PLACES=cores from OpenMP4.0. On AMD systems each module only has one FPU but gets two threads and I think it's assigned linearly stackoverflow.com/questions/19780554/… so doing GOMP_CPU_AFFINITY="0-7" won't work I think. Actually, OMP_PROC_BIND=true might be fine then as well. Maybe that's the best solution.Z boson
My comment was only that "0-7" is the same as "0 1 2 3 4 5 6 7". With libgomp OMP_PROC_BIND=true is practically the same as GOMP_CPU_AFFINITY="0-(#cpus-1)", i.e. there is no topology awareness, at least for versions before 4.9.Hristo Iliev
On AMD CPU's I had to use GOMP_CPU_AFFINITY="0-24:2" to get decent performance. Cores without FPU are just fake cores to me in this century.Vladimir F

1 Answers

3
votes

This isn't a direct answer to your question, but it might be worth looking in to: apparently, hyperthreading can cause your cache to thrash. Have you tried checking out valgrind to see what kind of issue is causing your problem? There might be a quick fix to be had from allocating some junk at the top of every thread's stack so that your threads don't end up kicking each others cache lines out.

It looks like your CPU is 4-way set associative so it's not insane to think that, across 8 threads, you might end up with some really unfortunately aligned accesses. If your matrices are aligned on a multiple of the size of your cache, and if you had pairs of threads accessing areas a cache-multiple apart, any incidental read by a third thread would be enough to start causing conflict misses.

For a quick test -- if you change your input matrices to something that's not a multiple of your cache size (so they're no longer aligned on a boundary) and your problems disappear, then there's a good chance that you're dealing with conflict misses.