0
votes

I have a program that runs correctly when compiled with the OpenMPI library, but fails with an error in MPI_Allreduce() when compiled with MPICH 3.2.1. This occurs both on Linux and MacOS.

The relevant code is

typedef struct reduction_packet {
  double sig;               /* logev */
  double s_width;           /* starting width */
  double s_nsites;          /* starting nsites */
  double width;             // width of motif
  double nsites_dis;            // final number of sites
  double llr;               // LLR of motif
  double classic;           // true if Classic objective function
  double ID; /* Use a double so the MPI type handle is simple. */
} REDUCE_PACKET;

REDUCE_PACKET a_packet, best_packet;
...
MPI_Allreduce((void *)&a_packet, (void *)&best_packet, 1,
   reduction_packet_type, max_packets_op, MPI_COMM_WORLD);

The root error on the MPI stack is

MPIR_Localcopy(100)......: memcpy arguments alias each other, dst=0x7ffeeadd2f80 src=0x7ffeeadd2fc0 len=72

My interpretation is that MPICH MPI is telling me that the variables a_packet and best_packet overlap because the length of the variable to be copied is 72 bytes, but the two variables are only offset by 64 bytes.

Each of these buffers is actually a struct composed of 8 doubles which accounts for 64 bytes. I could imagine there might be some padding to handle alignment, but the compiler seems to be happy allocating these two variables on the stack without padding. I've logged the addresses for a_packet and best_packet, and they match the addresses reported in the error message from MPIR_Localcopy().

If I change the declaration for the two variables to

REDUCE_PACKET a_packet;
char foo[2];
REDUCE_PACKET best_packet;

the program runs without error on MPICH and OpenMPI.

Why does MPICH think this variable requires 72 byes rather than 64 bytes? Am I missing something in the MPI/MPICH documentation that would inform me that I'm responsible for this sort of manual padding?

1
a minimal reproducible example is required for more investigations.Gilles Gouaillardet
meanwhile, you can print MPI_Type_get_true_extent(reduction_packet_type, ...) and compare the output between Open MPI and MPICHGilles Gouaillardet
@GillesGouaillardet thanks the suggestion to check MPI_Type_get_true_extent() was the key to debugging this.Charles E. Grant

1 Answers

0
votes

If you encounter this type of error, double check the definition of the MPI_datatype passed as the 3rd argument to MPI_Allreduce():

MPI_Allreduce((void *)&a_packet, (void *)&best_packet, 1,
   reduction_packet_type, max_packets_op, MPI_COMM_WORLD);

It turned out that we had changed the definition of the underlying C datatype, REDUCE_PACKET, but hadn't updated the call to MPI_Type_contiguous(), establishing the size the MPI_datatype, reduction_packet_type.

Apparently MPICH pays more attention to potentially overlapping memory copies than OpenMPI does.