1
votes

I want to create a new communicator that holds on to only the ranks that are used in the processing, if I have 24 processors available and I only need 10 then the group should only hold those 10, otherwise it will hold all of them. For some reason, when I attempt to create a communicator everything executes, but as soon as I try something like getting the size or rank of the new communicator MPI stops with an error.

 80     float **matrix;
 81     int *ranksArr;
 82     MPI_Comm default_comm;
 83     MPI_Group world_grp, new_grp;
 84     MPI_Comm_rank(MPI_COMM_WORLD, &proc_rank);
 85     MPI_Comm_size(MPI_COMM_WORLD, &proc_avail);
 86     MPI_Comm_group(MPI_COMM_WORLD, &world_grp);

 91     compute_block_size(&block, proc_avail);
 92
 93     if(block.procsUsed == proc_avail)
 94     {
 95         ranksArr = alloc_ranks_arr(proc_avail);
 96     }
 97     else
 98     {
 99         ranksArr = alloc_ranks_arr(block.procsUsed);
100         proc_avail = block.procsUsed;
101     }
102
103     MPI_Group_incl(world_grp, proc_avail, ranksArr, &new_grp);
104     MPI_Comm_create(MPI_COMM_WORLD, new_grp, &default_comm);
105     //MPI_Comm_size(default_comm, &proc_avail); //ERROR, default_comm
106
107     MPI_Comm_rank(default_comm, &proc_rank);
108
111     matrix = create_matrix_sub(&block, proc_rank);
112
113
114     dealloc_matrix(matrix);

178 int* alloc_ranks_arr(int totalRanks)
179 {
180     int *ranksToGroup = malloc(totalRanks * sizeof(int));
181     int i;
182
183     for(i = 0; i < totalRanks ; i++)
184     {
185         ranksToGroup[i] = i;
186     }
187
188     return ranksToGroup;
189 }

[cluster-srv2:24701] * An error occurred in MPI_Comm_rank [cluster-srv2:24701] * on communicator MPI_COMM_WORLD [cluster-srv2:24701] * MPI_ERR_COMM: invalid communicator [cluster-srv2:24701] * MPI_ERRORS_ARE_FATAL (your MPI job will now abort)

It says in the docs :

MPI_ERR_COMM Invalid communicator. A common error is to use a null communicator in a call (not even allowed in MPI_Comm_rank).

But I create the communicator right before calling Comm_rank and also the return value of MPI_Comm_create is giving me MPI_SUCCESS. So I have no idea why this is happening.

1
Ok after looking at different samples I think I understand the problem. The 13 other processors never get included in the group since I only generate 0-9 for the group, I suspect that this invalidates the newly created comm when called by ranks who aren't part of the group. Since I don't want to use those processors I'm thinking of calling MPI_Finalize on every rank that I won't be using. I am not sure if this approach is acceptable, but it might have to do for now.Patrick.SE

1 Answers

5
votes

A quick look at the documentation for MPI_Comm_create says:

In the case that a process calls with a group to which it does not belong, e.g., MPI_GROUP_EMPTY, then MPI_COMM_NULL is returned as newcomm.

So even though the MPI_Comm_create() call returns with MPI_SUCCESS, processes 11-24 receive MPI_COMM_NULL in default_comm, which is of course illegal to use in any kind of operation.

After the call to MPI_Comm_create, you should branch according to whether the process is in the new communicator or not, ideally by checking if default_comm == MPI_COMM_NULL.