I have a program that runs correctly when compiled with the OpenMPI library, but fails with an error in MPI_Allreduce()
when compiled with MPICH 3.2.1. This occurs both on Linux and MacOS.
The relevant code is
typedef struct reduction_packet {
double sig; /* logev */
double s_width; /* starting width */
double s_nsites; /* starting nsites */
double width; // width of motif
double nsites_dis; // final number of sites
double llr; // LLR of motif
double classic; // true if Classic objective function
double ID; /* Use a double so the MPI type handle is simple. */
} REDUCE_PACKET;
REDUCE_PACKET a_packet, best_packet;
...
MPI_Allreduce((void *)&a_packet, (void *)&best_packet, 1,
reduction_packet_type, max_packets_op, MPI_COMM_WORLD);
The root error on the MPI stack is
MPIR_Localcopy(100)......: memcpy arguments alias each other, dst=0x7ffeeadd2f80 src=0x7ffeeadd2fc0 len=72
My interpretation is that MPICH MPI is telling me that the variables a_packet and best_packet overlap because the length of the variable to be copied is 72 bytes, but the two variables are only offset by 64 bytes.
Each of these buffers is actually a struct composed of 8 doubles which accounts for 64 bytes. I could imagine there might be some padding to handle alignment, but the compiler seems to be happy allocating these two variables on the stack without padding. I've logged the addresses for a_packet
and best_packet
, and they match the addresses reported in the error message from MPIR_Localcopy()
.
If I change the declaration for the two variables to
REDUCE_PACKET a_packet;
char foo[2];
REDUCE_PACKET best_packet;
the program runs without error on MPICH and OpenMPI.
Why does MPICH think this variable requires 72 byes rather than 64 bytes? Am I missing something in the MPI/MPICH documentation that would inform me that I'm responsible for this sort of manual padding?
MPI_Type_get_true_extent(reduction_packet_type, ...)
and compare the output between Open MPI and MPICH – Gilles Gouaillardet