I have an array of same length on all ranks (Lets assume 10). Some values in the array contain the rank of the processor. For example ...
Proc 1: [1 0 0 0 0 1 0 0 0 1]
Proc 2: [0 2 2 0 0 0 0 2 2 0]
Proc 3: [0 0 0 3 3 0 3 0 0 0]
Now what is the most efficient way (using MPI-2) that all processors end with the following array
[1 2 2 3 3 1 3 2 2 1]
which can be thought of as the sum of all arrays (distributed on all ranks). Performance is important as I want to do this fast on 1K+ cores.