8
votes

I m trying to use robust mutexes on linux to guard resources between processes and it seems that in some situations they do not behave in the "robust" way. By "robust" way i mean that pthread_mutex_lock should return EOWNERDEAD if the process owning the lock has terminated.

Here is the scenario where it doesn't work:

2 processes p1 and p2. p1 creates robust mutex and waits on it (after user's input). p2 has 2 threads: thread 1 maps into the mutex and acquires it. thread 2 (after thread 1 has acquired the mutex) also maps into the same mutex and waits on it (since thread 1 owns it now). Also note that p1 starts waiting on the mutex after p2-thread1 has already acquire it.

Now if we terminate p2, p1 never unblocks (meaning it's pthread_mutex_lock never returns) contrary to the supposed "robustness" where p1 should unblock with EOWNERDEAD error.

Here is the code:

p1.cpp:

    #include <sys/types.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <pthread.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>

struct MyMtx {
    pthread_mutex_t m;
};

int main(int argc, char **argv)
{
    int r;

    pthread_mutexattr_t ma;
    pthread_mutexattr_init(&ma);
    pthread_mutexattr_setpshared(&ma, PTHREAD_PROCESS_SHARED);
    pthread_mutexattr_setrobust_np(&ma, PTHREAD_MUTEX_ROBUST_NP);

    int fd = shm_open("/test_mtx_p", O_RDWR|O_CREAT, 0666);
    ftruncate(fd, sizeof(MyMtx));

    MyMtx *m = (MyMtx *)mmap(NULL, sizeof(MyMtx),
        PROT_READ | PROT_WRITE, MAP_SHARED,fd, 0);
    //close (fd);

    pthread_mutex_init(&m->m, &ma);

    puts("Press Enter to lock mutex");
    fgetc(stdin);

    puts("locking...");
    r = pthread_mutex_lock(&m->m);
    printf("pthread_mutex_lock returned %d\n", r);

    puts("Press Enter to unlock");
    fgetc(stdin);
    r = pthread_mutex_unlock(&m->m);
    printf("pthread_mutex_unlock returned %d\n", r);

    puts("Before pthread_mutex_destroy");
    r = pthread_mutex_destroy(&m->m);
    printf("After pthread_mutex_destroy, r=%d\n", r);

    munmap(m, sizeof(MyMtx));
    shm_unlink("/test_mtx_p");

    return 0;
}

p2.cpp:

    #include <sys/types.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <pthread.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>

struct MyMtx {
    pthread_mutex_t m;
};

static void *threadFunc(void *arg)
{
    int fd = shm_open("/test_mtx_p", O_RDWR|O_CREAT, 0666);
    ftruncate(fd, sizeof(MyMtx));

    MyMtx *m = (MyMtx *)mmap(NULL, sizeof(MyMtx),
        PROT_READ | PROT_WRITE, MAP_SHARED,fd, 0);
    sleep(2); //to let the first thread lock the mutex
    puts("Locking from another thread");
    int r = 0;
    r = pthread_mutex_lock(&m->m);
    printf("locked from another thread r=%d\n", r);
}

int main(int argc, char **argv)
{
    int r;
    int fd = shm_open("/test_mtx_p", O_RDWR|O_CREAT, 0666);
    ftruncate(fd, sizeof(MyMtx));

    MyMtx *m = (MyMtx *)mmap(NULL, sizeof(MyMtx),
        PROT_READ | PROT_WRITE, MAP_SHARED,fd, 0);
    //close (fd);

    pthread_t tid;
    pthread_create(&tid, NULL, threadFunc, NULL);

    puts("locking");
    r = pthread_mutex_lock(&m->m);
    printf("pthread_mutex_lock returned %d\n", r);

    puts("Press Enter to terminate");
    fgetc(stdin);

    kill(getpid(), 9);
    return 0;
}

First, run p1, then run p2 and wait until it prints "Locking from another thread". Press Enter on p1's shell to lock the mutex, then press Enter on p2's shell to terminate p2, or you can just kill it some other way. You will see that p1 prints "locking..." and pthread_mutex_lock never returns.

The problem actually doesn't happen all the time, looks like it depends on timing. If you let some time elapse after p1 starts locking and before terminating p2, sometime it works and p2's pthread_mutex_lock returns 130 (EOWNERDEAD). But if you terminate p2 right after or short time after p1 starts waiting on the mutex, p1 will never unblock.

Has anybody else ever encountered the same issue?

2
I also changed the code of p2.cpp to avoid mapping into shared memory twice, by making MyMtx *m global variable and for threadFunc to use it instead of calling mmap. I m getting the same result.Yevgeniy P
Oddly, when I substitue SIGTERM for your SIGKILL it appears to work as expected. Does the spec say anything about varying behavior based on the signal?Duck
Hmm, i tried with SIGTERM and it still reproduces. The same with Ctrl-C of p2 (which is SIGINT i guess). If i terminate p2 right after p1 starts waiting, it actually always reproduces for me.Yevgeniy P
SIGINT works for me as well. p1 returns with a retcode = 130 (OWNER DIED). I was fiddling with your code aa bit but the only difference I see at the moment is that I commented out the ftruncates in p2.Duck
Which system are you running on? I m on Oracle Linux 5.Yevgeniy P

2 Answers

1
votes

Just verified behaviour with glibc version: 2.11.1 on Linux Kernel 2.6.32 and newer.

My first finding: Iff you hit Enter in p1 before "Locking from another thread" in p2 (within 2s) the robust mutex works fine resp. as one would expect. Conclusion: The ordering of the waiting threads is important.

The first waiting thread gets woken up. Unfortunately it is the Thread within p2 which, at that time, gets killed.

See https://lkml.org/lkml/2013/9/27/338 for a description of the problem.

I don't know whether there are kernel fixes/patches around. Don't even known whether it is considered a bug at all.

Neverthless there seems a workaround for the whole mess. Use robust mutexes with PTHREAD_PRIO_INHERIT:

pthread_mutexattr_setprotocol(&ma, PTHREAD_PRIO_INHERIT);

Inside kernel (futex.c) instead of handle_futex_death() some other mechanism within exit_pi_state_list() does handle the wake up of other mutex waiters. It seems to solve the problem.

0
votes

Try to simplified your problem. It seems that your problem is runing sequence.
Always consider the worst scenario: even you run A then B, B can still finish while A just start running. Add mutex control for that if necessary.
Here is a simple example for A(producer) and B(consumer):

        Main:
            Call A

        A:
            Lock
            Call B
            Produce
            Unlock

        B:
            Lock
            Consume
            Unlock