2
votes

I have a Perl script (snippet below) that runs in cron to perform system checks. I fork a child as a timeout and reap it with SIG{CHLD}. Perl does several system calls of Bash scripts and checks their exit status. One bash script fails about 5% of the time with no error. The Bash scripts exists with 0 and Perl sees $? as -1 and $! as "No child processes".

This bash script tests compiler licenses, and Intel icc is left around after the Bash script completes (ps output below). I think the icc zombie completes, forcing Perl into SIG{CHLD} handler, which blows away the $? status before I'm able to read it.

Compile status -1; No child processes

#!/usr/bin/perl
use strict;
use POSIX ':sys_wait_h';

my $GLOBAL_TIMEOUT = 1200;

### Timer to notify if this program hangs
my $timer_pid;
$SIG{CHLD} = sub {
    local ($!, $?);
    while((my $pid = waitpid(-1, WNOHANG)) > 0)
    {
        if($pid == $timer_pid)
        {
            die "Timeout\n";
        }
    }
};

die "Unable to fork\n" unless(defined($timer_pid = fork));
if($timer_pid == 0)  # child
{
    sleep($GLOBAL_TIMEOUT);
    exit;
}
### End Timer

### Compile test
my @compile = `./compile_test.sh 2>&1`;
my $status = $?;
print "Compile status $status; $!\n";
if($status != 0)
{
    print "@compile\n";
}

END  # Timer cleanup
{
    if($timer_pid != 0)
    {
        $SIG{CHLD} = 'IGNORE';
        kill(15, $timer_pid);
    }
}

exit(0);
#!/bin/sh

cc compile_test.c
if [ $? -ne 0 ]; then
    echo "Cray compiler failure"
    exit 1
fi

module swap PrgEnv-cray PrgEnv-intel
cc compile_test.c
if [ $? -ne 0 ]; then
    echo "Intel compiler failure"
    exit 1
fi

wait
ps
exit 0

The wait doesn't really wait because cc calls icc which creates a zombie grandchild process that wait (or wait PID) doesn't block for. (wait `pidof icc`, 31589 in this case, gives "not a child of this shell")

user 31589     1  0 12:47 pts/15   00:00:00 icc

I just don't know how to fix this in Bash or Perl.

Thanks, Chris

3
It looks like you are going to a lot of trouble to avoid using alarm. Is there a reason not to use alarm here? - mob
Your SIGCHLD handler is also reaping the shell spawned by the backticks, so the waitpid call done by the backticks fails (since the child has already been reaped). - ikegami
I have several bash calls in the real Perl script. Only this one fails periodically. Just noticed today the icc left behind, that "wait" can't catch. - Chris
"this one fails" -- I didn't get what fails? The fact that icc stays around (which is awkward), or is there an actual error? Note that "Compile status -1; No child processes" isn't an error since you have a CHLD handler and check $? after backticks, which may have gotten reaped by handler (so the only error is doing both). Also, from what you show it appears that cc starts icc and doesn't wait for it ...? (Are you sure? That sounds really strange to me.) - zdim
Note, you can't really check wait 31589 (or such) since you don't know what PID of a child is in the current run (it is most likely different from what it was in previous runs). - zdim

3 Answers

1
votes

Isn't this a use case for alarm? Toss out your SIGCHLD handler and say

local $? = -1;
eval {
    local $SIG{ALRM} = sub { die "Timeout\n" };
    alarm($GLOBAL_TIMEOUT);
    @compile = `./compile_test.sh 2>&1`;
    alarm(0);
};

my $status = $?;

instead.

1
votes

I thought the quickest solution would be to add sleep of a second or two at the bottom of the bash script to wait for the zombie icc to complete. But that didn't work.

If I didn't already have a SIG ALRM (in the real program) I agree the best choice would be to wrap the whole thing in a eval. Even thought that would be pretty ugly for a 500 line program.

Without the local($?), every `system` call gets $? = -1. The $? I need in this case is after waitpid, then unfortunately set to -1 after the sig handler exits. So I find this works. New lines shown with ###

my $timer_pid;
my $chld_status;    ###
$SIG{CHLD} = sub {
    local($!, $?);
    while((my $pid = waitpid(-1, WNOHANG)) > 0)
    {
        $chld_status = $?;    ###
        if($pid == $timer_pid)
        {
            die "Timeout\n";
        }
    }
};

...
my @compile = `./compile_test.sh 2>&1`;
my $status = ($? == -1) ? $chld_status : $?;    ###
...
1
votes

We had a similar issue, here is our solution: Leak a write-side file descriptor into the grandchild and read() from it which will block until it exits.

See also: wait for children and grand-children

use Fcntl;

# OCF scripts invoked by Pacemaker will be killed by Pacemaker with
# a SIGKILL if the script exceeds the configured resource timeout. In
# addition to killing the script, Pacemaker also kills all of the children
# invoked by that script. Because it is a kill, the scripts cannot trap
# the signal and clean up; because all of the children are killed as well,
# we cannot simply fork and have the parent wait on the child. In order
# to work around that, we need the child not to have a parent proccess
# of the OCF script---and the only way to do that is to grandchild the
# process. However, we still want the parent to wait for the grandchild
# process to exit so that the OCF script exits when the grandchild is
# done and not before. This is done by leaking the write file descriptor
# from pipe() into the grandchild and then the parent reads the read file
# descriptor, thus blocking until it gets IO or the grandchild exits. Since
# the file descriptor is never written to by the grandchild, the parent
# blocks until the child exits.
sub grandchild_wait_exit
{
    # We use "our" instead of "my" for the write side of the pipe. If
    # we did not, then when the sub exits and $w goes out of scope,
    # the file descriptor will close and the parent will exit.
    pipe(my $r, our $w);

    # Enable leaking the file descriptor into the children
    my $flags = fcntl($w, F_GETFD, 0) or warn $!;
    fcntl($w, F_SETFD, $flags & (~FD_CLOEXEC)) or die "Can't set flags: $!\n";

    # Fork the child
    my $child = fork();
    if ($child) {
        # We are the parent, waitpid for the child and
        # then read to wait for the grandchild.
        close($w);
        waitpid($child, 0);
        <$r>;
        exit;
    }

    # Otherwise we are the child, so close the read side of the pipe.
    close($r);

    # Fork a grandchild, exit the child.
    if (fork()) {
        exit;
    }

    # Turn off leaking of the file descriptor in the grandchild so
    # that no other process can write to the open file descriptor
    # that would prematurely exit the parent.
    $flags = fcntl($w, F_GETFD, 0) or warn $!;
    fcntl($w, F_SETFD, $flags | FD_CLOEXEC) or die "Can't set flags: $!\n";
}

grandchild_wait_exit();

sleep 1;
print getppid() . "\n";
print "$$: gc\n";
sleep 30;
exit;