Closing pipe does not interrupt read() in child process spawned from thread

Question

In a Linux application I'm spawning multiple programs via fork/execvp and redirect the standard IO streams to a pipe for IPC. I spawn a child process, write some data into the child stdin pipe, close stdin, and then read the child response from the stdout pipe. This worked fine, until I've executed multiple child processes at the same time, using independent threads per child process.

As soon I increase the number of threads, I often find that the child processes hang while reading from stdin – although read should immediately exit with EOF because the stdin pipe has been closed by the parent process.

I've managed to reproduce this behaviour in the following test program. On my systems (Fedora 23, Ubuntu 14.04; g++ 4.9, 5, 6 and clang 3.7) the program often simply hangs after three or four child processes have exited. Child processes that have not exited are hanging at read(). Killing any child process that has not exited causes all other child processes to magically wake up from read() and the program continues normally.

#include <chrono>
#include <iostream>
#include <mutex>
#include <thread>
#include <vector>

#include <sys/fcntl.h>
#include <sys/wait.h>
#include <unistd.h>

#define HANDLE_ERR(CODE)     \
    {                        \
        if ((CODE) < 0) {    \
            perror("error"); \
            quick_exit(1);   \
        }                    \
    }

int main()
{
    std::mutex stdout_mtx;
    std::vector<std::thread> threads;
    for (size_t i = 0; i < 8; i++) {
        threads.emplace_back([&stdout_mtx] {
            int pfd[2]; // Create the communication pipe
            HANDLE_ERR(pipe(pfd));

            pid_t pid; // Fork this process
            HANDLE_ERR(pid = fork());
            if (pid == 0) {
                HANDLE_ERR(close(pfd[1])); // Child, close write end of pipe
                for (;;) { // Read data from pfd[0] until EOF or other error
                    char buffer;
                    ssize_t bytes;
                    HANDLE_ERR(bytes = read(pfd[0], &buffer, 1));
                    if (bytes < 1) {
                        break;
                    }

                    // Allow time for thread switching
                    std::this_thread::sleep_for(std::chrono::milliseconds(
                        100));  // This sleep is crucial for the bug to occur
                }
                quick_exit(0); // Exit, do not call C++ destructors
            }
            else {
                { // Some debug info
                    std::lock_guard<std::mutex> lock(stdout_mtx);
                    std::cout << "Created child " << pid << std::endl;
                }

                // Close the read end of the pipe
                HANDLE_ERR(close(pfd[0]));

                // Send some data to the child process
                HANDLE_ERR(write(pfd[1], "abcdef\n", 7));

                // Close the write end of the pipe, wait for the process to exit
                int status;
                HANDLE_ERR(close(pfd[1]));
                HANDLE_ERR(waitpid(pid, &status, 0));

                { // Some debug info
                    std::lock_guard<std::mutex> lock(stdout_mtx);
                    std::cout << "Child " << pid << " exited with status "
                              << status << std::endl;
                }
            }
        });
    }

    // Wait for all threads to complete
    for (auto &thread : threads) {
        thread.join();
    }

    return 0;
}

Compile using

g++ test.cpp -o test -lpthread --std=c++11

Note that I'm perfectly aware that mixing fork and threads is potentially dangerous, but please keep in mind that in the original code I'm immediately calling execvp after forking, and that I don't have any shared state between the child child process and the main program, except for the pipes specifically created for IPC. My original code (without the threading part) can be found here.

To me this almost seems like a bug in the Linux kernel, since the program continues correctly as soon as I kill any of the hanging child processes.

Likely some other process you forked off has that pipe open too. The pipe will only close when the last reference to its other end is closed. — David Schwartz
I've checked the pipe fd's and they are unique across all processes/threads. How exactly could a pipe be shared between multiple processes in the above code? — Andreas Stöckel
I think I see it now: If the code is interrupted between pipe() and fork(), multiple processes may possess the same pipe... — Andreas Stöckel

Andreas Stöckel Andreas Stöckel · Accepted Answer · 2016-06-27T19:38:14

This problem is caused by two fundamental principles of how fork and pipes work in Unix. a) the pipe description is reference counted. The pipe is only closed, if all pipe file descriptors pointing at its other end (referencing the descriptions) are closed. b) fork duplicates all open file descriptors of a process.

In the above code, the following race condition might happen: If a thread switch occurs and fork is called between the pipe and fork system calls, the pipe file descriptors are duplicated, causing the write/read ends to be open multiple times. Remember that all duplicates must be closed for the EOF to be generated – which will not happen if there is another duplicate astray an unrelated process.

The best solution is to use the pipe2 system call with the O_CLOEXEC flag and to immediately call exec in the child process after a controlled duplicate of the file descriptor is created using dup2:

HANDLE_ERR(pipe2(pfd, O_CLOEXEC));
HANDLE_ERR(pid = fork());
if (pid == 0) {
    HANDLE_ERR(close(pfd[1])); // Child, close write end of pipe
    HANDLE_ERR(dup2(pfd[0], STDIN_FILENO));
    HANDLE_ERR(execlp("cat", "cat"));
}

Note that the FD_CLOEXEC flag is not copied by the dup2 system call. This way all child processes will automatically close all the file descriptors they should not receive as soon as they reach the exec system call.

From the man-page on open on O_CLOEXEC:

O_CLOEXEC (since Linux 2.6.23) Enable the close-on-exec flag for the new file descriptor. Specifying this flag permits a program to avoid additional fcntl(2) F_SETFD operations to set the FD_CLOEXEC flag.

Note that the use of this flag is essential in some multithreaded programs, because using a separate fcntl(2) F_SETFD operation to set the FD_CLOEXEC flag does not suffice to avoid race conditions where one thread opens a file descriptor and attempts to set its close-on-exec flag using fcntl(2) at the same time as another thread does a fork(2) plus execve(2). Depending on the order of execution, the race may lead to the file descriptor returned by open() being unintentionally leaked to the program executed by the child process created by fork(2). (This kind of race is in principle possible for any system call that creates a file descriptor whose close-on-exec flag should be set, and various other Linux system calls provide an equivalent of the O_CLOEXEC flag to deal with this problem.)

The phenomenon of all child processes suddenly exiting when one child process is killed can be explained by comparing this issue to the dining philosophers problem. In the same way as killing one of the philosophers will solve the deadlock, killing one of the processes will close one of the duplicated file descriptors, triggering an EOF in another child process which will exit in return, freeing one of the duplicated file descriptors...

Thank you to David Schwartz for pointing this out.

Closing pipe does not interrupt read() in child process spawned from thread

1 Answers