I have a cluster and I’m stuck with it and mpi4py. I have quite a complex code and MPI just fails on transferring data. To get things more clear I’ve written a simple “hello world” code that just transfers large arrays between nodes. Arrays are initialized with 0 and then filled with ones coming from another node.
import dill
from mpi4py import MPI
MPI.pickle.__init__(dill.dumps, dill.loads)
comm = MPI.COMM_WORLD
rank = comm.rank
import numpy as np
for k in range(5):
if rank == 0:
# node 0 sends hi to other nodes
for i in range(1, comm.size):
msg = np.ones(10000000, np.double)
comm.Send([msg, MPI.DOUBLE], dest=i, tag=0)
else:
# other nodes receive hi
msgin = np.zeros(10000000, np.double)
comm.Recv([msgin, MPI.DOUBLE], source=0, tag=0)
with open('solution1.txt', 'a') as f:
f.write(f'{rank} hi, {msgin[:10]} {np.average(msgin)}\n')
# and then send reply to 0 node
msgout = np.ones(10000000)
comm.Send([msgout, MPI.DOUBLE], dest=0, tag=1)
if rank == 0:
# node 0 receives replies
for i in range(1, comm.size):
msgin = np.zeros(10000000, np.double)
comm.Recv([msgin, MPI.DOUBLE], tag=1, source=i)
with open('solution1.txt', 'a') as f:
f.write(f'{rank} reply, {msgin[:10]} {np.average(msgin)}\n')
Here are the results:
1 hi [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 1.0
2 hi [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 1.0
3 hi [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 1.0
4 hi [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 1.0
5 hi [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 1.0
1 reply [1. 1. 1. 1. 1. 1. 0. 0. 0. 0.] 6e-08
2 reply [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 1.0
3 reply [1. 1. 1. 1. 1. 1. 0. 0. 0. 0.] 6e-08
4 reply [1. 1. 1. 1. 1. 1. 0. 0. 0. 0.] 6e-08
5 reply [1. 1. 1. 1. 1. 1. 0. 0. 0. 0.] 6e-08
1 hi [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 1.0
2 hi [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 1.0
3 hi [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 1.0
4 hi [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 1.0
5 hi [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 1.0
1 reply [1. 1. 1. 1. 1. 1. 0. 0. 0. 0.] 6e-08
2 reply [1. 1. 1. 1. 1. 1. 0. 0. 0. 0.] 6e-08
3 reply [1. 1. 1. 1. 1. 1. 0. 0. 0. 0.] 6e-08
4 reply [1. 1. 1. 1. 1. 1. 0. 0. 0. 0.] 6e-08
5 reply [1. 1. 1. 1. 1. 1. 0. 0. 0. 0.] 6e-08
1 hi [1. 1. 1. 1. 1. 1. 0. 0. 0. 0.] 6e-08
As you can see, occasionally there are only 6 double values transferred instead of 10000000. This log is not complete - all subsequent messages have only 6 values too. It’s interesting, that results are reproducible: node 2 always reply with correct message at first and all other nodes reply with incorrect ones.
The code runs perfectly on a single node of the same cluster. Also it runs perfectly in google cloud (6 nodes 32 core each). I have tried different tricks and got the same result:
replaced Send / Recv with Isend/Irecv + Wait
used send/recv with standard pickle and with dill pickle. This code fails on decoding pickle data.
tried openmpi 2.1.1, 4.0.1 and intel mpi libraries
tried a fix from intel:
export I_MPI_SHM_LMT=shm
May be there is a problem with network setup, but I really don't know what to try.
The setup is a multi-node cluster with a Mellanox 4x FDR Infiniband interconnect in a 2-1 oversubscribed fat tree. Sets of 24 nodes have 12 uplinks into the large core Infiniband switch. Each node has 64 GiB of 4-channel 2133 MHz DDR4 SDRAM (68 Gigabytes/second) memory; two Intel Xeon E5-2670 v3 (Haswell) CPUs.
msgout = np.ones(10000000)
withmsgout = np.ones(10000000, np.double)
– Gilles Gouaillardet