Is this a POSIX-compliant implementation for handling signals such as SIGFPE, SIGSEGV, etc. in a multithreaded program?

Question

I'm developing a program that needs to handle crash signals. By crash signal, I mean signals "delivered as a consequence of a hardware exception" [1], such as SIGFPE and SIGSEGV. I haven't found a specific name that describes this signal category, so I'm coming up with this one for clarity and less verbosity.

According to my research, catching these signals is a pain. A crash signal handler must not return, otherwise the behavior is undefined [2][3]. Having undefined behavior means an implementation may kill the process or re-raise the signal, leaving the program stuck in an infinite loop, which is not desirable.

On the other hand, there is little freedom inside signal handlers in general, specially in a multithreaded program: functions called within a signal handler must be both thread-safe and async-signal-safe [4]. For example, you may not call malloc() as it is not async-signal-safe, and neither can you call other functions that depend on it. In particular, as I'm using C++, I cannot make a safe call to GCC's abi::__cxa_demangle() to produce a decent stack trace because it uses malloc() internally. While I could use Chromium's library symbolize [5] for async-signal-safe and thread-safe C++ symbol name demangling, I could not use dladdr() for a more informative stack trace as it is not specified async-signal-safe.

An alternative approach for handling generic signals is blocking them in a worker thread with sigprocmask() (or pthread_sigmask() in a multithreaded program) and calling sigwait() in that thread. This works for non-crash signals such as SIGINT and SIGTERM. However, "if any of the SIGFPE, SIGILL, SIGSEGV, or SIGBUS signals are generated while they are blocked, the result is undefined" [6], and again, all bets are off.

Skimming through the man pages of signal-safety [4], I found out that sem_post() is async-signal-safe (and thread-safe, of course) and implemented a solution around it which is similar to the sigwait() approach. The idea is to spawn a signal processing thread which blocks signals with pthread_sigmask() and calls sem_wait(). A crash signal handler is also defined such that whenever a crash signal is raised, the handler sets the signal to a global-scope variable, calls sem_post(), and waits until the signal processing thread finishes processing and exits the program.

Note that the following implementation does not check the return values from syscalls for the sake of simplicity.

// Std
#include <atomic>
#include <cstdlib>
#include <ctime>
#include <iostream>
#include <thread>

// System
#include <semaphore.h>
#include <signal.h>
#include <unistd.h>

// NOTE: C++20 exempts it from `ATOMIC_FLAG_INIT`
std::atomic_flag caught_signal = ATOMIC_FLAG_INIT;
int crash_sig = 0;

sem_t start_semaphore;
sem_t signal_semaphore;

extern "C" void crash_signal_handler(int sig)
{
    // If two or more threads evaluate this condition at the same time,
    // one of them shall enter the if-branch and the rest will skip it.
    if (caught_signal.test_and_set(std::memory_order_relaxed) == false)
    {
        // `crash_sig` needs not be atomic since only this thread and 
        // the signal processing thread use it, and the latter is
        // `sem_wait()`ing.
        crash_sig = sig;
        sem_post(&signal_semaphore);
    }

    // It is undefined behavior if a signal handler returns from a crash signal.
    // Implementations may re-raise the signal infinitely, kill the process, or whatnot,
    // but we want the crash signal processing thread to try handling the signal first;
    // so don't return.
    //
    // NOTE: maybe one could use `pselect()` here as it is async-signal-safe and seems to 
    //       be thread-safe as well. `sleep()` is async-signal-safe but not thread-safe.
    while (true)
        ;

    const char msg[] = "Panic: compiler optimized out infinite loop in signal handler\n";

    write(STDERR_FILENO, msg, sizeof(msg));
    std::_Exit(EXIT_FAILURE);
}

void block_crash_signals()
{
    sigset_t set;
    sigemptyset(&set);
    sigaddset(&set, SIGSEGV);
    sigaddset(&set, SIGFPE);

    pthread_sigmask(SIG_BLOCK, &set, nullptr);
}

void install_signal_handler()
{
    // NOTE: one may set an alternate stack here.

    struct sigaction sig;
    sig.sa_handler = crash_signal_handler;
    sig.sa_flags   = 0;

    ::sigaction(SIGSEGV, &sig, nullptr);
    ::sigaction(SIGFPE,  &sig, nullptr);
}

void restore_signal_handler()
{
    struct sigaction sig;
    sig.sa_handler = SIG_DFL;
    sig.sa_flags   = 0;

    ::sigaction(SIGSEGV, &sig, nullptr);
    ::sigaction(SIGFPE,  &sig, nullptr);
}

void process_crash_signal()
{
    // If a crash signal occurs, the kernel will invoke `crash_signal_handler` in
    // any thread which may be not this current one.
    block_crash_signals();

    install_signal_handler();

    // Tell main thread it's good to go.
    sem_post(&start_semaphore);

    // Wait for a crash signal.
    sem_wait(&signal_semaphore);

    // Got a signal.
    //
    // We're not in kernel space, so we are "safe" to do anything from this thread,
    // such as writing to `std::cout`. HOWEVER, operations performed by this function,
    // such as calling `std::cout`, may raise another signal. Or the program may be in
    // a state where the damage was so severe that calling any function will crash the
    // program. If that happens, there's not much what we can do: this very signal
    // processing function is broken, so let the kernel invoke the default signal
    // handler instead.
    restore_signal_handler();

    const char* signame;

    switch (crash_sig)
    {
        case SIGSEGV: signame = "SIGSEGV"; break;
        case SIGFPE:  signame = "SIGFPE"; break;
        default:      signame = "weird, this signal should not be raised";
    }

    std::cout << "Caught signal: " << crash_sig << " (" << signame << ")\n";

    // Uncomment these lines to invoke `SIG_DFL`.
    // volatile int zero = 0;
    // int a = 1 / zero;

    std::cout << "Sleeping for 2 seconds to prove that other threads are waiting for me to finish :)\n";
    std::this_thread::sleep_for(std::chrono::seconds{ 2 });

    std::cout << "Alright, I appreciate your patience <3\n";

    std::exit(EXIT_FAILURE);
}

void divide_by_zero()
{
    volatile int zero = 0;
    int oops = 1 / zero;
}

void access_invalid_memory()
{
    volatile int* p = reinterpret_cast<int*>(0xdeadbeef); // dw, I know what I'm doing lmao
    int oops = *p;
}

int main()
{
    // TODO: maybe use the pthread library API instead of `std::thread`.
    std::thread worker{ process_crash_signal };

    // Wait until `worker` has started.
    sem_wait(&start_semaphore);

    std::srand(static_cast<unsigned>(std::time(nullptr)));

    while (true)
    {
        std::cout << "Odds are the program will crash...\n";

        switch (std::rand() % 3)
        {
            case 0:
                std::cout << "\nCalling divide_by_zero()\n";
                divide_by_zero();
                std::cout << "Panic: divide_by_zero() returned!\n";
                return 1;

            case 1:
                std::cout << "\nCalling access_invalid_memory()\n";
                access_invalid_memory();
                std::cout << "Panic: access_invalid_memory() returned!\n";
                return 1;

            default:
                std::cout << "...not this time, apparently\n\n";
                continue;
        }
    }

    return 0;
}

Compiling it with

$ g++ --version
g++ (Debian 9.2.1-22) 9.2.1 20200104
$ g++ -pthread -o handle_crash_signal handle_crash_signal.cpp

yields

$ ./handle_crash_signal 
Odds are the program will crash...

Calling access_invalid_memory()
Caught signal: 11 (SIGSEGV)
Sleeping for 2 seconds to prove that other threads are waiting for me to finish :)
Alright, I appreciate your patience <3

[1] https://man7.org/linux/man-pages/man7/signal.7.html

[2] http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1318.htm

[3] Returning From Catching A Floating Point Exception

[4] https://man7.org/linux/man-pages/man7/signal-safety.7.html

[5] https://chromium.googlesource.com/chromium/src/base/+/master/third_party/symbolize

[6] https://pubs.opengroup.org/onlinepubs/9699919799/functions/sigprocmask.html

Related thread: Catching signals such as SIGSEGV and SIGFPE in multithreaded program

I can neither confirm nor deny if this is POSIX-compliant. I can say that trying to continue after SIGSEGV is not the right move 99.85% of the time. Are you sure you're in that small minority? Would you consider siglongjump() to be a viable option? Divide-by-zero seems much more recoverable if you tweak the system state and then return from the signal handler. The others, not as much. — metal
The intention of this implementation is to "continue", not as a recover attempt, but in a way such that stack traces may be performed with some degree of signal safety. You can see that std::exit() is called at the end of the signal processing thread. — user10015341
If by tweak the system state you mean performing platform-dependent operations in ucontext_t, I'm not an Assembly programmer :( — user10015341
The usual term, which is not quite accurate but is often good enough, is synchronous signals. — Davis Herring

John Bollinger John Bollinger · Accepted Answer · 2020-08-14T21:14:10

No, it is not POSIX-compliant. Defined signal-handler behavior is especially restricted for multi-threaded programs, as described in the documentation of the signal() function:

If the process is multi-threaded [...] the behavior is undefined if the signal handler refers to any object other than errno with static storage duration other than by assigning a value to an object declared as volatile sig_atomic_t [...].

Your signal handler's proposed access to the semaphore therefore would cause the program's behavior to be undefined, regardless of which function you use. Your handler could conceivably create a local semaphore and manipulate it with async-signal safe functions, but that would not serve a useful purpose. There is no conforming way for it to access a semaphore (or most any other object) with wider scope.

Is this a POSIX-compliant implementation for handling signals such as SIGFPE, SIGSEGV, etc. in a multithreaded program?

1 Answers