1
votes

I have two questions regarding using mpi shared memory communication

1) If I have a MPI rank which is the only one that writes to a window, is it necessary to employ mpi_win_lock, and mpi_win_unlock? I know that my application would never have others trying to write to that window. They only read the content of the window, and I make sure that they read after a MPI_BARRIER, so the content of the window has been updated.

2) In my application I have one MPI rank, which allocates a shared window that needs to be read by 1:N other MPI ranks.

MPI rank 1 shall only read: rma(1:10)

MPI rank 2 shall only read rma(11:20)

MPI rank N shall only read rma(10*(N-1)+1:10*N)

Currently, all 1 to N ranks are querying the whole shared window, i.e. the size "10*N" with MPI_WIN_SHARED_QUERY.

I am asking if it is possible to apply the MPI_WIN_SHARED_QUERY function such that MPI rank 1 only can access the window from 1:10 and rank 2 from 11:20 etc.

In this way, each rank has a local accessing from 1:10 but they refer to different chunks of the shared window? Is this possible?

Thanks very much!

UPDATE

Based on the below answer seems to do what I want. But it does not work when using MPI_WIN_SHARED_QUERY

However, I dont understand how the local pointers are pointing to different sections of the array automatically. How does it know how do it. The only thing you are doing is doing the c_f_pointer call with the size being nlocal=5. How does it know that e.g. rma for rank 3 must access 5 places beginning from 16-20. It is really not clear to me, and I am concerned whether it is portable i.e. can I rely on it?

1

1 Answers

2
votes

First, I would recommend using MPI_Win_fence for synchronisation rather than MPI_Barrier - this ensures synchronisation in time like barrier, but also ensures all operations on the window are visible (e.g. writes should be flushed to memory).

If you use MPI_Win_allocate_shared() then you automatically achieve what you want - each rank gets a pointer to its local section. However, the memory is contiguous so you can access all of it by over/under-indexing the array elements (you could just use normal Fortran pointers to point to a subsection of an array allocated purely by rank 0 but I think MPI_Win_allocate_shared() is more elegant).

Here is some code that illustrates the point - a shared array is created, initialised by rank 0 but read by all ranks.

This seems to work OK on my laptop:

me@laptop:~$ mpirun -n 4 ./rmatest
 Running on            4  processes with n =           20
 Rank            2  in COMM_WORLD is rank            2  in nodecomm on node laptop
 Rank            3  in COMM_WORLD is rank            3  in nodecomm on node laptop
 Rank            0  in COMM_WORLD is rank            0  in nodecomm on node laptop
 Rank            1  in COMM_WORLD is rank            1  in nodecomm on node laptop
 rank, noderank, arr:            0           0           1           2           3           4           5
 rank, noderank, arr:            3           3          16          17          18          19          20
 rank, noderank, arr:            2           2          11          12          13          14          15
 rank, noderank, arr:            1           1           6           7           8           9          10

although in general this will only work across all the ranks in the same shared-memory node.

program rmatest

  use iso_c_binding, only: c_ptr, c_f_pointer

  use mpi

  implicit none

! Set the size of the road

  integer, parameter :: nlocal = 5
  integer :: i, n
  integer, dimension(MPI_STATUS_SIZE) :: status

  integer, pointer, dimension(:) :: rma

  integer :: comm, nodecomm, nodewin
  integer :: ierr, size, rank, nodesize, noderank, nodestringlen
  integer(MPI_ADDRESS_KIND) :: winsize
  integer :: intsize, disp_unit
  character*(MPI_MAX_PROCESSOR_NAME) :: nodename
  type(c_ptr) :: baseptr

  comm = MPI_COMM_WORLD

  call MPI_Init(ierr)

  call MPI_Comm_size(comm, size, ierr)
  call MPI_Comm_rank(comm, rank, ierr)

  ! Create node-local communicator

  call MPI_Comm_split_type(comm, MPI_COMM_TYPE_SHARED, rank, &
                           MPI_INFO_NULL, nodecomm, ierr)

  ! Check it all went as expected

  call MPI_Get_processor_name(nodename, nodestringlen, ierr)
  call MPI_Comm_size(nodecomm, nodesize, ierr)
  call MPI_Comm_rank(nodecomm, noderank, ierr)

  n = nlocal*nodesize

  if (rank == 0) then

     write(*,*) "Running on ", size, " processes with n = ", n

  end if

  write(*,*) "Rank ", rank," in COMM_WORLD is rank ", noderank, &
             " in nodecomm on node ", nodename(1:nodestringlen)

  call MPI_Type_size(MPI_INTEGER, intsize, ierr)

  winsize = nlocal*intsize

  ! displacements counted in units of integers

  disp_unit = intsize

  call MPI_Win_allocate_shared(winsize, disp_unit, &
       MPI_INFO_NULL, nodecomm, baseptr, nodewin, ierr)

  ! coerce baseptr to a Fortran array: global on rank 0, local on others

  if (noderank == 0) then

     call c_f_pointer(baseptr, rma, [n])

  else

     call c_f_pointer(baseptr, rma, [nlocal])

  end if

  ! Set the local arrays

  rma(1:nlocal) = 0

  ! Set values on noderank 0

  call MPI_Win_fence(0, nodewin, ierr)

  if (rank == 0) then
     do i = 1, n
        rma(i) = i
     end do
  end if

  call MPI_Win_fence(0, nodewin, ierr)

  ! Print the values  

  write(*,*) "rank, noderank, arr: ", rank, noderank, (rma(i), i=1,nlocal)

  call MPI_Finalize(ierr)

end program rmatest