There have two Linux nodes(hostA and hostB) in my test environments, I need to trigger a script(worker.sh) to run on all nodes concurrently, the worker.sh was already placed in all nodes, so I use threads module in my Perl scripts(master.pl), here is the code snippet:
use threads(stringify);
sub runByThreads{
my($count,$funcion,$host_ref,$cmd) = @_;
@hostlist = @{$host_ref};
my $thread;
my @failNodes;
for (my $i=0;$i<$count;$i++) {
my $host =@hostlist[$i];
$thread = threads->create($funcion,$host,$cmd);
$parserState{$thread} = $host;
$thread_num ++;
}
while ($thread_num != 0) { # stuck in this while loop
foreach my $subthread(threads->list(threads::joinable)) {
my $ret = $subthread->join();
if ($ret != 0) {
....
}
$thread_num --;
}
sleep 2;
}
}
sub runCmd {
my ($host,$cmd) = @_;
chomp($localhost = `hostname -f`);
if ($localhost eq $host) {
$ret = system("source /etc/profile; $cmd");
} else {
$ret = system("ssh -o StrictHostKeyChecking=no ".$host." \"source /etc/profile; ". $cmd."\"");
}
return $ret;
}
main {
my @servers = qw/hostA hostB/
my $nodecount = scalar(@servers);
my $arg = "--node";
$cmd = "$HOME/worker.sh "."$arg";
my @ret = &runByThreads($nodecount,\&runCmd,\@servers,$cmd);
if ( scalar(@ret) != 0) {
$failNum += 1;
}
}
&main;
This perl script is run on hostA, in normal case, the ps command shows:
0 S optitest 9338 9337 0 80 0 - 50630 pipe_w 06:57 ? 00:00:00 /usr/bin/perl master.pl
0 S optitest 9992 9338 0 80 0 - 26536 wait 06:57 ? 00:00:00 sh -c source /etc/profile; /home/jack/linux/worker.sh --node
0 S optitest 10023 9338 0 80 0 - 14151 poll_s 06:57 ? 00:00:00 ssh -o StrictHostKeyChecking=no hostB source /etc/profile; /home/jack/linux/worker.sh --node
0 S optitest 10757 10741 0 80 0 - 1608 pipe_w 06:59 ? 00:00:00 grep 9338
but sometimes, the ps shows there has a defunct process exists, and the defunct process will cause master.pl stuck in the while loop,
0 S optitest 6503 6502 1 80 0 - 50628 pipe_w 05:51 ? 00:00:00 /usr/bin/perl master.pl
0 Z optitest 7496 6503 0 80 0 - 0 exit 05:51 ? 00:00:00 [hostname] <defunct>
0 S optitest 7497 6503 0 80 0 - 26536 wait 05:51 ? 00:00:00 sh -c source /etc/profile; cd /home/jack/linux/worker.sh --node
I know zombie process is a process that has completed execution (via the exit system call) but still has an entry in the process table , This occurs for child processes, where the entry is still needed to allow the parent process to read its child's exit status: once the exit status is read via the wait system call, the zombie's entry is removed from the process table and it is said to be "reaped"
I am confused how the defunct process was generated in my tests, the defunct process should be the one run work.pl on hostB via ssh, but I found it seems the process become defunct process immediately when it created by the Perl system call, because I didn't see any output for its run, even the 'echo' in the first line of worker.sh was't executed.
One thing is also strange, in worker.sh, some scripts are called to run in background, If I empty the worker.sh on hostB, the defunct issue can also be happened, but if I empty the worker.sh on both hostA and hostB, I never see the defunct issue again.
Sorry for the long post, I am trying my best to make my question more clear, could you please help me to check what was going wrong, did I miss anything when use the threads module, or there have some issues of the threads module, because I noticed that the use of interpreter-based threads in perl is officially discouraged. http://perldoc.perl.org/threads.html