6
votes

I have written a simple C program in RedHat Linux which waits for a child process using waitpid after calling execv.

int main( int argc, char * argv[] )
{
    int pid;
    int status = 0;
    int wait_ret;

    const char * process_path = argv[1];

    if ( argc < 2 )
    {
        exit( EXIT_FAILURE );
    }

    pid = fork(); //spawn child process

    if ( 0 == pid ) //child
    {
        int ret = execv( process_path, &argv[1] );

        if ( ret )
        {
            printf( "execv failed: %s\n", strerror( errno ) );
        }

        exit( EXIT_SUCCESS );
    }

    //wait for the child to terminate
    wait_ret = waitpid( pid, &status, WUNTRACED );

    if ( -1 == wait_ret )
    {
        printf( "ERROR: Failed to wait for process termination\n" );
        exit( EXIT_FAILURE );
    }

    // ... handlers for child exit status ...

    return 0;
}

I am using this as a simple watchdog for some processes I am runnning.

My problem is that one process in particular is not being reaped by waitpid upon exiting and instead remains forever in a Zombie state while waitpid is hung. I am not sure why waitpid is unable to reap this process once it becomes a Zombie (maybe a leaked file descriptor or something).

I could use the WNOHANG flag and poll the child's stat proc file to check for the Zombie state but I would prefer a more elegant solution. Maybe there is some function that I could use to get the Zombie status from without polling this file?

Does anyone know an alternative to waitpid which WILL return when the process becomes a Zombie?

Additional Information:

The child process is being closed by a call to exit( EXIT_FAILURE); in one of its threads.

cat /proc/<CHILD_PID>/stat (before exit):

1037 (my_program) S 1035 58 58 0 -1 4194560 1309 0 22 0 445 1749 0 0 20 0 13 0 4399 22347776 1136 4294967295 3336716288 3338455332 3472776112 3472775232 3335760920 0 0 4 31850 4294967295 0 0 17 0 0 0 26 0 0 3338489412 3338507560 3338600448

cat /proc/<CHILD_PID>/stat (after exit):

1037 (my_program) Z 1035 58 58 0 -1 4227340 1316 0 22 0 464 1834 0 0 20 0 2 0 4399 0 0 4294967295 0 0 0 0 0 0 0 4 31850 4294967295 0 0 17 0 0 0 26 0 0 0 0 0

Note that the child PID is 1037 and the parent PID is 1035 in this case.

2
@HaukeLaging Last I checked there was not a C specific Linux stack exchange. Do you want me to ask in Stack Overflow, Software Engineering, or Code Review? This is a Linux specific question not a programming question. - Nathan Owen
What happens if the child exits before the parent has a chance to execute waitpid()? - AlexP
@AlexP From 'man waitpid': "If a child has already changed state, then these calls return immediately. Otherwise, they block until either a child changes state or a signal handler interrupts the call". That said, I am triggering the exit so it has been a long time since waitpid was called. - Nathan Owen
This is a question for Stack Overflow. There is no problem with a question being Linux-specific on SO. Questions which are only relevant for C programmers are off-topic here. We'll see if a majority decides to move the question there. - Hauke Laging
@HaukeLaging I suppose one could argue that all questions here could belong either on Stack Overflow or Super User. Since waitpid is a Linux specific operation, and my question is related only to Linux process behavior, I thought it better to ask here. - Nathan Owen

2 Answers

0
votes

My problem is that one process in particular is not being reaped by waitpid upon exiting and instead remains forever in a Zombie state while waitpid is hung ? If I understand correctly, you don't want child to become zombie then Use SA_NOCLDWAIT flag. From the manual page of sigaction()

SA_NOCLDWAIT (since Linux 2.6) If signum is SIGCHLD, do not transform children into zombies when they terminate. See also waitpid(2). This flag is meaningful only when establishing a handler for SIGCHLD, or when setting that signal's disposition to SIG_DFL.

              If the SA_NOCLDWAIT flag is set when establishing a  handler
              for SIGCHLD, POSIX.1 leaves it unspecified whether a SIGCHLD
              signal is generated when a  child  process  terminates.   On
              Linux,  a  SIGCHLD signal is generated in this case; on some
              other implementations, it is not.

Idea is when child process completes first, parent receives signal no 17 or SIGCHLD & child process will become zombie as parent still running. So how to remove child ASAP it becomes zombie, solution is use flags SA_NOCLDWAIT.

Here is the sample code

void my_isr(int n) {
        /* error handling */
}
int main(void) {
        if(fork()==0) { /* child process */
                printf("In child process ..c_pid: %d and p_pid : %d\n",getpid(),getppid());
                sleep(5);
                printf("sleep over .. now exiting \n");
        }
        else { /*parent process */
                struct sigaction v;
                v.sa_handler=my_isr;/* SET THE HANDLER TO ISR */
                v.sa_flags=SA_NOCLDWAIT; /* it will not let child to become zombie */
                sigemptyset(&v.sa_mask);
                sigaction(17,&v,NULL);/* when parent receives SIGCHLD, IT GETS CALLED */
                while(1); /*for observation purpose, to make parent process alive */
        }
        return 0;
}

Just comment/uncomment the v.sa_flags=SA_NOCLDWAIT; line & analyze the behavior by running a.out in one terminal & check ps -el | grep pts/0 in another terminal.

Does anyone know an alternative to waitpid which WILL return when the process becomes a Zombie ? use WNOHANG as you did & told in manual page of waitpid()

WUNTRACED also return if a child has stopped (but not traced via ptrace(2)). Status for traced children which have stopped is provided even if this option is not specified.

0
votes

Any process that terminates becomes a zombie until it is collected by a wait call. Here the wait does not seem to happen in all cases.

From the code given I can't figure out why the wait does not happen and the process remains a zombie. (not without running it anyway)

But instead of waiting on a specific pid only, you can wait on any child by using -1 as the first argument to waitpid. Don't use WNOHANG, as it require busy polling (don't do that).

You may also want to drop WUNTRACED unless you have a specific reason to include it. But there is no harm in dropping it and see what difference it makes.