1
votes

I have a cluster and I’m stuck with it and mpi4py. I have quite a complex code and MPI just fails on transferring data. To get things more clear I’ve written a simple “hello world” code that just transfers large arrays between nodes. Arrays are initialized with 0 and then filled with ones coming from another node.

import dill
from mpi4py import MPI
MPI.pickle.__init__(dill.dumps, dill.loads)

comm = MPI.COMM_WORLD
rank = comm.rank

import numpy as np

for k in range(5):
    if rank == 0:
        # node 0 sends hi to other nodes
        for i in range(1, comm.size):
            msg = np.ones(10000000, np.double)
            comm.Send([msg, MPI.DOUBLE], dest=i, tag=0)
    else:
        # other nodes receive hi
        msgin = np.zeros(10000000, np.double)
        comm.Recv([msgin, MPI.DOUBLE], source=0, tag=0)
        with open('solution1.txt', 'a') as f:
            f.write(f'{rank} hi, {msgin[:10]} {np.average(msgin)}\n')
        # and then send reply to 0 node
        msgout = np.ones(10000000)
        comm.Send([msgout, MPI.DOUBLE], dest=0, tag=1)

    if rank == 0:
        # node 0 receives replies
        for i in range(1, comm.size):
            msgin = np.zeros(10000000, np.double)
            comm.Recv([msgin, MPI.DOUBLE], tag=1, source=i)
            with open('solution1.txt', 'a') as f:
                f.write(f'{rank} reply, {msgin[:10]} {np.average(msgin)}\n')


Here are the results:

1 hi [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 1.0
2 hi [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 1.0
3 hi [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 1.0
4 hi [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 1.0
5 hi [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 1.0
1 reply [1. 1. 1. 1. 1. 1. 0. 0. 0. 0.] 6e-08
2 reply [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 1.0
3 reply [1. 1. 1. 1. 1. 1. 0. 0. 0. 0.] 6e-08
4 reply [1. 1. 1. 1. 1. 1. 0. 0. 0. 0.] 6e-08
5 reply [1. 1. 1. 1. 1. 1. 0. 0. 0. 0.] 6e-08
1 hi [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 1.0
2 hi [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 1.0
3 hi [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 1.0
4 hi [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 1.0
5 hi [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 1.0
1 reply [1. 1. 1. 1. 1. 1. 0. 0. 0. 0.] 6e-08
2 reply [1. 1. 1. 1. 1. 1. 0. 0. 0. 0.] 6e-08
3 reply [1. 1. 1. 1. 1. 1. 0. 0. 0. 0.] 6e-08
4 reply [1. 1. 1. 1. 1. 1. 0. 0. 0. 0.] 6e-08
5 reply [1. 1. 1. 1. 1. 1. 0. 0. 0. 0.] 6e-08
1 hi [1. 1. 1. 1. 1. 1. 0. 0. 0. 0.] 6e-08

As you can see, occasionally there are only 6 double values transferred instead of 10000000. This log is not complete - all subsequent messages have only 6 values too. It’s interesting, that results are reproducible: node 2 always reply with correct message at first and all other nodes reply with incorrect ones.

The code runs perfectly on a single node of the same cluster. Also it runs perfectly in google cloud (6 nodes 32 core each). I have tried different tricks and got the same result:

  • replaced Send / Recv with Isend/Irecv + Wait

  • used send/recv with standard pickle and with dill pickle. This code fails on decoding pickle data.

  • tried openmpi 2.1.1, 4.0.1 and intel mpi libraries

  • tried a fix from intel:

export I_MPI_SHM_LMT=shm

May be there is a problem with network setup, but I really don't know what to try.

The setup is a multi-node cluster with a Mellanox 4x FDR Infiniband interconnect in a 2-1 oversubscribed fat tree. Sets of 24 nodes have 12 uplinks into the large core Infiniband switch. Each node has 64 GiB of 4-channel 2133 MHz DDR4 SDRAM (68 Gigabytes/second) memory; two Intel Xeon E5-2670 v3 (Haswell) CPUs.

2
when replying, try replacing msgout = np.ones(10000000) with msgout = np.ones(10000000, np.double)Gilles Gouaillardet
Thanks, I've corrected it - but numpy uses np.double as defaul so the results are the same.Andrew Telyatnik

2 Answers

0
votes

In my setup I found it useful to always use the same buffer to receive the results.

So may be try to declare msgin = np.zeros(10000000, np.double) just once early in the code and use it again and again. Each time that a message is received you may do an msgin.fill(np.nan) to "clear" the buffer.

I have no clue why it worked for me, but the problem seems gone.

Good luck!

0
votes

I had an analogous problem in a similar cluster environment. It only appeared in connection with Infiniband multinode communication, for send/receive buffers of a certain size (> 30MB). I could solve the problem by downgrading from OpenMPI 3.x to OpenMPI 2.1.5 (Mpi4py version seems irrelevant)

A corresponding C code was running fine (same cluster, number of cores/nodes, same OpenMPI 3.x, etc.) hence the problem seems to be between mpi4py and OpenMPI 3, but it was hopeless for me to debug.