You need to make sure then number of Barrier calls is the same for each process. In your particular case, when n=3 you have two Barrier calls for rank 0 and rank 1 but only 1 for rank 2. The program will block until the rank 2 process also reaches a Barrier.
Here is what should be happening for n=3:
together:
rank 0 will reach barrier 1 then block
rank 1 will print "some output", reach barrier 2 then block
rank 2 will print "some output", reach barrier 3 then block
together:
rank 0 will print "some output", reach barrier 3 then block
rank 1 will reach barrier 3 then block
rank 2 will print "end" then hit finalize
Having one process in finalize while others are blocked is going to be undefined behaviour.
Doing the same analysis for n=2:
together:
rank 0 will reach barrier 1 then block
rank 1 will print "some output", reach barrier 2 then block
together:
rank 0 will print "some output", reach barrier 3 then block
rank 1 will reach barrier 3 then block
together:
rank 0 will print "end" then hit finalize
rank 1 will print "end" then hit finalize
This suggests the output should be:
some output
some output
end
end
however you are getting:
some output
end
some output
end
This has to do with how the mpi infrastructure is caching the transfer of stdout from the various ranks. We can see the behaviour better if we introduce a delay so that MPI decides it should gather the results:
#include <cstdint>
#include <unistd.h>
#include <mpi.h>
#include <iostream>
using namespace std;
int main(int argc, char **argv)
{
MPI_Init(&argc, &argv);
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
cout << rank << " Barrier 1\n" << flush;
MPI_Barrier(MPI_COMM_WORLD);
}
cout << rank << " Some output \n" << flush;
usleep(1000000);
if (rank == 1) {
cout << rank << " Barrier 2\n" << flush;
MPI_Barrier(MPI_COMM_WORLD);
}
cout << rank << " Barrier 3\n" << flush;
MPI_Barrier(MPI_COMM_WORLD);
cout << rank << " end\n" << flush;
usleep(1000000);
MPI_Finalize();
return 0;
}
which produces:
$ mpiexec -n 2 ./a.out
0 Barrier 1
1 Some output
0 Some output
1 Barrier 2
1 Barrier 3
0 Barrier 3
0 end
1 end
$ mpiexec -n 3 ./a.out
2 Some output
0 Barrier 1
1 Some output
0 Some output
1 Barrier 2
1 Barrier 3
2 Barrier 3
2 end
0 Barrier 3
^Cmpiexec: killing job...
Alternatively, look at the time stamps from the following C++11 code:
#include <cstdint>
#include <chrono>
#include <mpi.h>
#include <iostream>
using namespace std;
inline unsigned long int time(void) {
return std::chrono::high_resolution_clock::now().time_since_epoch().count();
}
int main(int argc, char **argv)
{
MPI_Init(&argc, &argv);
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
MPI_Barrier(MPI_COMM_WORLD);
}
cout << rank << " " << time() << " Some output\n";
if (rank == 1) {
MPI_Barrier(MPI_COMM_WORLD);
}
MPI_Barrier(MPI_COMM_WORLD);
cout << rank << " " << time() << " end\n";
MPI_Finalize();
return 0;
}
output:
$ mpiexec -n 2 ./a.out
0 1464100768220965374 Some output
0 1464100768221002105 end
1 1464100768220902046 Some output
1 1464100768221000693 end
sorted by timestamp:
$ mpiexec -n 2 ./a.out
1 1464100768220902046 Some output
0 1464100768220965374 Some output
1 1464100768221000693 end
0 1464100768221002105 end
The conclusion is that Barrier is behaving as expected, and that print statements are not necessarily going to tell you that.
Edit: 2016-05-24 to show detailed analysis of program behaviour.