I'm trying to implement an All-To-All (i.e. MPI_Allgather) operation on a hyper-cube network using C++.
For example, for n( i.e. number of processors) = 8, I store the initial data as
p0: [00, 01, 02, ..., 07];
p1: [10, 11, 12, ..., 17],
...
...
p7: [70, 71, 72, ..., 77].
Eventually after running All-To-All, data should become
p0: [00, 10, 20, ..., 70],
P1: [01, 11, 21, ..., 71],
...,
p7: [07, 17, 27, ..., 77].
(In other words, every processor grabs data from everyone else).
I though of the algorithm using some mask and loop that involves the step of swapping data between two processors, e.g., swap last 4 elements of p0 with first 4 elements of p3 (sending last 4 elements of p0 to p3 and sending first 4 elements of p3 to p0 at the same time). using MPI_Send and MPI_Recv cannot achieve this because the receivers' half array will be overwritten before it sends out its data. Could anyone help me with what techniques I could use to do this? I thought about using a intermediate buffer, but still not exactly sure how to write the send and receive MPI code.
Or if someone can tell me any other way to implement All-to-All. I would really appreciate.Thank you very much!