2
votes

I have been experimenting for the last couple of days with MPI to write fault tolerant applications in C. I am trying to learn how to attach an error handler to the MPI_COMM_WORLD communicator so that in case a node goes down (possibly due to a crash) and exits without calling MPI_Finalize() the program can still recover from this situation and continue computations.

The problem that I am having so far is that after I attach the error handler function to the communication and then cause a node to crash, MPI does not call the error handler but rather forces all threads to exit.

I thought it might be a problem with my application so I looked for sample code online and tried running it but the situation is the same... The sample code that I am currently trying to run is the following. (which I got from here https://www.google.co.uk/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CC4QFjAA&url=http%3A%2F%2Fwww.shodor.org%2Fmedia%2Fcontent%2F%2Fpetascale%2Fmaterials%2FdistributedMemory%2Fpresentations%2FMPI_Error_Example.pdf&ei=jq6KUv-BBcO30QW1oYGABg&usg=AFQjCNFa5L_Q6Irg3VrJ3fsQBIyqjBlSgA&sig2=8An4SqBvhCACx5YLwBmROA apologies for being in pdf but i didnt write it, so I now paste the same code below):

/* Template for creating a custom error handler for MPI and a simple program 
to demonstrate its' use. How much additional information you can obtain 
is determined by the MPI binding in use at build/run time. 

To illustrate that the program works correctly use -np 2 through -np 4.

To illustrate an MPI error set victim_mpi = 5 and use -np 6.

To illustrate a system error set victim_os = 5 and use -np 6.

2004-10-10 charliep created
2006-07-15 joshh  updated for the MPI2 standard
2007-02-20 mccoyjo  adapted for folding@clusters
2010-05-26 charliep cleaned-up/annotated for the petascale workshop 
*/
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include "mpi.h"

void ccg_mpi_error_handler(MPI_Comm *, int *, ...);

int main(int argc, char *argv[]) {
    MPI_Status status;
    MPI_Errhandler errhandler;
    int number, rank, size, next, from;
    const int tag = 201;
    const int server = 0;
    const int victim_mpi = 5;
    const int victim_os = 6;

    MPI_Comm bogus_communicator;
    MPI_Init(&argc, &argv);!
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    MPI_Comm_create_errhandler(&ccg_mpi_error_handler, &errhandler);
    MPI_Comm_set_errhandler(MPI_COMM_WORLD, errhandler);

    next = (rank + 1) % size;
    from = (rank + size - 1) % size;

    if (rank == server) {
        printf("Enter the number of times to go around the ring: ");
        fflush(stdout);
        scanf("%d", &number);                                              
        --number;
        printf("Process %d sending %d to %d\n", rank, number, next);
        MPI_Send(&number, 1, MPI_INT, next, tag, MPI_COMM_WORLD);
    }

    while (true) {
        MPI_Recv(&number, 1, MPI_INT, from, tag, MPI_COMM_WORLD, &status);
        printf("Process %d received %d\n", rank, number);
        if (rank == server) {
            number--;
            printf("Process 0 decremented number\n");
        }

        if (rank == victim_os) {
            int a[10];
            printf("Process %d about to segfault\n", rank);
            a[15565656] = 56;
        }

        if (rank == victim_mpi) {
            printf("Process %d about to go south\n", rank);
            printf("Process %d sending %d to %d\n", rank, number, next);
           MPI_Send(&number, 1, MPI_INT, next, tag, bogus_communicator);
        } else {
            printf("Process %d sending %d to %d\n", rank, number, next);
            MPI_Send(&number, 1, MPI_INT, next, tag, MPI_COMM_WORLD);
        }

        if (number == 0) {
            printf("Process %d exiting\n", rank);
            break;
        }
    }

    if (rank == server)
        MPI_Recv(&number, 1, MPI_INT, from, tag, MPI_COMM_WORLD, &status);

    MPI_Finalize();
    return 0;
}

void ccg_mpi_error_handler(MPI_Comm *communicator, int *error_code, ...) {
    char error_string[MPI_MAX_ERROR_STRING];
    int error_string_length;
    printf("ccg_mpi_error_handler: entry\n");
    printf("ccg_mpi_error_handler: error_code = %d\n", *error_code);
    MPI_Error_string(*error_code, error_string, &error_string_length);
    error_string[error_string_length] = '\0';
    printf("ccg_mpi_error_handler: error_string = %s\n", error_string);
    printf("ccg_mpi_error_handler: exit\n");
    exit(1);
}

The program implements a simple token ring and if you give it the parameters described in the comments then I get something like this:

    >>>>>>mpirun -np 6 example.exe
    Enter the number of times to go around the ring: 6
    Process 1 received 5
    Process 1 sending 5 to 2
    Process 2 received 5
    Process 2 sending 5 to 3
    Process 3 received 5
    Process 3 sending 5 to 4
    Process 4 received 5
    Process 4 sending 5 to 5
    Process 5 received 5
    Process 5 about to go south
    Process 5 sending 5 to 0
    Process 0 sending 5 to 1
    [HP-ENVY-dv6-Notebook-PC:09480] *** Process received signal *** 
    [HP-ENVY-dv6-Notebook-PC:09480] Signal: Segmentation fault (11)
    [HP-ENVY-dv6-Notebook-PC:09480] Signal code: Address not mapped (1) 
    [HP-ENVY-dv6-Notebook-PC:09480] Failing at address: 0xf0b397
    [HP-ENVY-dv6-Notebook-PC:09480] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) [0x7fc0ec688cb0]
    [HP-ENVY-dv6-Notebook-PC:09480] [ 1] /usr/lib/libmpi.so.0(PMPI_Send+0x74) [0x7fc0ec8f3704]
    [HP-ENVY-dv6-Notebook-PC:09480] [ 2] example.exe(main+0x23f) [0x400e63]
    [HP-ENVY-dv6-Notebook-PC:09480] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7fc0ec2da76d]
    [HP-ENVY-dv6-Notebook-PC:09480] [ 4] example.exe() [0x400b69]
    [HP-ENVY-dv6-Notebook-PC:09480] *** End of error message *** 
    --------------------------------------------------------------------------
    mpirun noticed that process rank 5 with PID 9480 on node andres-HP-ENVY-dv6-Notebook-PC exited on signal 11 (Segmentation fault).
    --------------------------------------------------------------------------

Clearly, in the output that I see, none of the printf() in the ccg_mpi_error_handler() have been executed so I assume the handler wasnt called at all. I am not sure if its of any help, but I am running ubuntu linux 12.04 and I installed MPI by using apt-get. The command I used to compile the program is the following:

mpicc err_example.c -o example.exe

Also, when I do mpicc -v I get the following:

  Using built-in specs.
  COLLECT_GCC=/usr/bin/gcc
  COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.6/lto-wrapper
  Target: x86_64-linux-gnu
  Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 4.6.3-1ubuntu5' --with-bugurl=file:///usr/share/doc/gcc-4.6/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.6 --enable-shared --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.6 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --enable-plugin --enable-objc-gc --disable-werror --with-arch-32=i686 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
  Thread model: posix
  gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)

help is greatly appreciated! Thanks...

2

2 Answers

4
votes

The MPI standard does not require that MPI implementations are even able to handle gracefully errors. The following excerpt from §8.3 of MPI-3.0 says it all:

An MPI implementation cannot or may choose not to handle some errors that occur during MPI calls. These can include errors that generate exceptions or traps, such as floating point errors or access violations. The set of errors that are handled by MPI is implementation-dependent. Each such error generates an MPI exception.

The above text takes precedence over any text on error handling within this document. Specifically, text that states that errors will be handled should be read as may be handled.

(original formatting preserved, including usage of bold and italic fonts)

There are many reasons for that but most of them has to do with some sort of trade-off between performance and reliability. Having error checks at various levels and handling gracefully error conditions incur some not-so-tiny overhead and makes the library code base very complex.

That said, not all MPI libraries are created equal. Some of them implement better fault tolerance than the others. For example, the same code with Intel MPI 4.1:

...
Process 5 about to go south
Process 5 sending 5 to 0
ccg_mpi_error_handler: entry
ccg_mpi_error_handler: error_code = 403287557
ccg_mpi_error_handler: error_string = Invalid communicator, error stack:
MPI_Send(186): MPI_Send(buf=0x7fffa32a7308, count=1, MPI_INT, dest=0, tag=201, comm=0x0) failed
MPI_Send(87).: Invalid communicator
ccg_mpi_error_handler: exit

The format of the error message in your case suggests that you are using Open MPI. The fault tolerance in Open MPI is kind of experimental (one of the OMPI developers, namely Jeff Squyres, visits Stack Overflow from time to time - he could give a more definitive answer) and has to be explicitly enabled at library build time with an option like --enable-ft=LAM.

By default MPICH also cannot handle such situations:

Process 5 about to go south
Process 5 sending 5 to 0

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

Note that currently MPI does not guarantee that the program state remains consistent when an error is detected:

After an error is detected, the state of MPI is undefined. That is, using a user-defined error handler, or MPI_ERRORS_RETURN, does not necessarily allow the user to continue to use MPI after an error is detected. The purpose of these error handlers is to allow a user to issue user-defined error messages and to take actions unrelated to MPI (such as flushing I/O buffers) before a program exits. An MPI implementation is free to allow MPI to continue after an error but is not required to do so.

One of the reasons is that it becomes impossible to perform collective operations on such "broken" communicators and many internal MPI mechanisms require collective information sharing between all ranks. A much better fault tolerance mechanism called run-through stabilisation (RTS) was proposed for inclusion in MPI-3.0 but it didn't make it past the final voting. With RTS a new MPI call is added, which creates a healthy communicator from a broken one by collectively removing all failed processes and the remaining processes could then continue to operate within the new communicator.

Disclaimer: I do not work for Intel and do not endorse their products. It is just the case that IMPI provides better out-of-the-box implementation of user error handling than the default build configurations of Open MPI and MPICH. It might be possible to achieve comparable levels of fault tolerance in both open-source implementations by changing the build options or proper FT might be coming in the future (e.g. there is a prototype implementation of RTS in Open MPI)

1
votes

While Hristo is correct in everything he mentioned, the picture isn't quite so bleak. It's true that by default, there is no fault tolerance in almost any MPI implementation. However, there are options to turn on experimental fault tolerance in both Open MPI and MPICH. Hristo mentioned the build flag for Open MPI. For MPICH, the option is a runtime flag for mpiexec. Use the following command:

mpiexec -n <num_procs> --disable-auto-cleanup <executable> <program args>

The --disable-auto-cleanup flag will tell MPICH not to automatically kill all of the processes when one process fails. This allows you to trigger your custom MPI_Errhandler. In order to use this of course, you need a sufficiently new version of MPICH. I think that anything after MPICH 3.0 will work, but I don't remember when that feature was added. Currently, MPICH is in the preview releases for 3.1, so you can try that if you're daring.

While FT didn't make it into MPI 3.0 per se (called User Level Failure Mitigation, not Run Through Stabilization, that was an older FT proposal), there is hope for FT applications, even when using collectives. You can try the new point-to-point communicator creation function MPI_COMM_CREATE_GROUP to create a new communicator after a failure. Obviously this will be a bit tricky and you'll need to make sure that you handle all of your ongoing operations carefully, but it's possible to do things. Alternatively, you can avoid collectives and everything remains much easier.