1
votes

it's my first question on StackOverflow :-) So sorry if I post the question in a wrong way...

This is my problem: I have to compare the Recursive Fibonacci algorithm with different parallel programming model: Cilk, openMP...and openMPI

Cilk and OpenMP were trivial, but openMPI it's a bit more complicated for me...

I found an implementation of the Recursive Fibonacci that uses MPI_Comm_spawn and it works, but MPI_Comm_spawn primitive creates and execute new processes on the master node only. So the cluster is unused.

So...my ask is: there is a way to execute the spawned processes on the entire cluster? Otherwise, there are other solutions to implement Recursive Fibonacci with openMPI?

Thank you for helping me! :-)

This is the code that actually works on master node only:

[MASTER]

int main (int argc, char **argv){
  long n, fibn;
  int world_size, flag;
  int universe_size = 10;
  int myrank;
  char command[] = "slave_fib";
  MPI_Comm children_comm;
  MPI_Status status;
  int errcodes[1];

  MPI_Init (&argc, &argv);
  MPI_Comm_size (MPI_COMM_WORLD, &world_size);
  MPI_Comm_rank (MPI_COMM_WORLD, &myrank);

  MPI_Info local_info;
  MPI_Info_create (&local_info);

  if (world_size != 1)
    perror ("Top heavy with management");

  MPI_Comm_get_attr (MPI_COMM_WORLD, MPI_UNIVERSE_SIZE, &universe_size,              &flag);
  if (universe_size == 1)
    perror ("No room to start workers");

  // Prepare argv for spawning the recursive process
  argv += 1;
  n = atol (argv[0]);

  if (n < 2){
      printf ("fib(%ld)=%ld\n", n, n);
      exit (0);
  }else{
      sprintf (argv[0], "%ld", n);
      MPI_Comm_spawn (command, argv, 1, local_info, myrank, MPI_COMM_SELF,
                      &children_comm, errcodes);
  }
  MPI_Recv (&fibn, 1, MPI_LONG, MPI_ANY_SOURCE, 1, children_comm,
            MPI_STATUS_IGNORE);

  printf ("fib(%ld)=%ld\n", n, fibn);
  fflush(stdout);

  MPI_Finalize ();
}


##### SPAWNED BYNARY #####

int main (int argc, char **argv){
  long n, fibn, x, y;
  int myrank, size;
  char command[] = "slave_fib";
  MPI_Comm children_comm[2];
  MPI_Comm parent;
  MPI_Info local_info;
  int world_size,flag;
  int universe_size=10;
  int errcodes[1];

  MPI_Init (&argc, &argv);
  MPI_Comm_get_parent (&parent);
  MPI_Comm_rank (MPI_COMM_WORLD, &myrank);
  MPI_Info_create (&local_info);

  MPI_Comm_size (MPI_COMM_WORLD, &world_size);

  if (parent == MPI_COMM_NULL)
    perror ("No parent!");

  if (parent != MPI_COMM_NULL)
    MPI_Comm_remote_size (parent, &size);

  if (size != 1)
    perror ("Something's wrong with the parent");

  MPI_Comm_get_attr (MPI_COMM_WORLD, MPI_UNIVERSE_SIZE, &universe_size, &flag);

  argv += 1;
  n = atol (argv[0]);
  if (n < 2){

      MPI_Send (&n, 1, MPI_LONG, 0, 1, parent);

  }else{

      sprintf (argv[0], "%ld", (n - 1));

      MPI_Comm_spawn (command, argv, 1, local_info, myrank,
                      MPI_COMM_SELF, &children_comm[0], errcodes);

      sprintf (argv[0], "%ld", (n - 2));

      MPI_Comm_spawn (command, argv, 1, local_info, myrank,
                      MPI_COMM_SELF, &children_comm[1], errcodes);

      MPI_Recv (&y, 1, MPI_LONG, MPI_ANY_SOURCE, 1,
                children_comm[1], MPI_STATUS_IGNORE);

      fibn = x + y;             // computation

      MPI_Send (&fibn, 1, MPI_LONG, 0, 1, parent);
    }

  MPI_Finalize ();
}

How to execute it: mpirun -np 1 bynary name fib_num

The only way to execute it is with -np 1, if you set np > 1 the execution will return an error ( for the MPI_Comm_spawn )

1
Are you running under a batch manager ? Which one ? - Gilles Gouaillardet
Nope. I'm trying to install and test torque PBS right now... do you think that torque will solve the problem? Or should I use another batch manager? - colbacc8
Any batch manager should help. Meanwhile, you can mpirun --host host1:n1,host2:n2,... -np 1 ... to use more than one node - Gilles Gouaillardet
I'm tried on a fresh installation of openMPI, without torque or others batch manager. Now the processes are spowned among the nodes, but I received this error: [[6022,0],0] ERROR: message to [[2048,34327],0] requires routing and the OOB has no knowledge of this process I'm investigating on it...mybe I've to add more info into the MPI environment - colbacc8

1 Answers

0
votes

After a fresh installation of ubuntu 16.04 and the libopenmpi-dev 1.10.2 on a cluster of 4 nodes the fibonacci computation seems works and the spawned processes are spread on all nodes.( without Torque)

But when I want to compute a fibonacci number more than 10, I receive some errors... 1) sometimes the execution wait forever the end of a spawned process 2) sometimes I receive this error:

 Child job 67 terminated normally, but 1 process returned a non-zero
 exit code..

Moreover I receive a lot of this messages in each execution:

[[30037,42],0] dpm_base_disconnect_init: error -12 in isend to process 0

These messages appear when the computation fail and when it end successfully as well. Probably am I using the comm_spawn and send/recv in a wrong way?