I am trying to control where I execute my MPI code. To do so there are several way, taskset, dplace, numactl or just the options of mpirun like --bind-to or -cpu-set.
The machine: is shared memory, 16 nodes, of 2 times 12cores (so 24 cores per nodes)
> numactl -H
available: 16 nodes (0-15)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 192 193 194 195 196 197 198 199 200 201 202 203
node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 204 205 206 207 208 209 210 211 212 213 214 215
node 2 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 216 217 218 219 220 221 222 223 224 225 226 227
... (I reduce the output)
node 15 cpus: 180 181 182 183 184 185 186 187 188 189 190 191 372 373 374 375 376 377 378 379 380 381 382 383
node distances:
node 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0: 10 50 65 65 65 65 65 65 65 65 79 79 65 65 79 79
1: 50 10 65 65 65 65 65 65 65 65 79 79 65 65 79 79
2: 65 65 10 50 65 65 65 65 79 79 65 65 79 79 65 65
3: 65 65 50 10 65 65 65 65 79 79 65 65 79 79 65 65
4: 65 65 65 65 10 50 65 65 65 65 79 79 65 65 79 79
5: 65 65 65 65 50 10 65 65 65 65 79 79 65 65 79 79
6: 65 65 65 65 65 65 10 50 79 79 65 65 79 79 65 65
7: 65 65 65 65 65 65 50 10 79 79 65 65 79 79 65 65
8: 65 65 79 79 65 65 79 79 10 50 65 65 65 65 65 65
9: 65 65 79 79 65 65 79 79 50 10 65 65 65 65 65 65
10: 79 79 65 65 79 79 65 65 65 65 10 50 65 65 65 65
11: 79 79 65 65 79 79 65 65 65 65 50 10 65 65 65 65
12: 65 65 79 79 65 65 79 79 65 65 65 65 10 50 65 65
13: 65 65 79 79 65 65 79 79 65 65 65 65 50 10 65 65
14: 79 79 65 65 79 79 65 65 65 65 65 65 65 65 10 50
15: 79 79 65 65 79 79 65 65 65 65 65 65 65 65 50 10
My code does not take advantage of the shared memory, I would like to use it as on distributed memory. But the processes seems to move and get too far from their data, so I would like to bind them and see if the performance is better.
What I have try so far:
the classic call mpirun -np 64 ./myexec param > logfile.log
Now I wanted to bind the run on the last nodes, lets say 12 to 15, with dplace or numactl (I do not see main difference...)
mpirun -np 64 dplace -c144-191,336-383 ./myexec param > logfile.log
mpirun -np 64 numactl --physcpubind=144-191,336-383 -l ./myexec param > logfile.log
(the main difference of the two is the -l of numactl that 'bound' the memory, but I am not even sure that it makes a difference..)
So, they both work well, the processes are bounded where I wanted to, BUT by looking closer to each process, it appears that some are allocated on the same core! so they are using only 50% of the core each! This happen even if the number of available core is larger than the number of processes! This is not good at all.
So I try to add some mpirun optin like --nooversubscribe but it changes nothing... I do not understand that. I also try with --bind-to none (to avoid conflict between mpirun and dplace/numactl), -cpus-per-proc 1 and -cpus-per-rank 1... not solving it.
So, I tried with only mpirun option
mpirun -cpu-set 144-191 -np 64 ./myexec param > logfile.log
but the -cpu-set option is not massively documented, and I do not find a way to bind one process per core.
The Question: May someone help me to have one process per core, on the cores that I want ?