waitpid returns ECHILD - but pid was valid

Question

I have a program that spawns other processes with execve:

  s32 ret = execve( argv[0], argv.data(), (char* const*) req.posixEnv() );

Then later in a loop I call waitpid to watch for when the process terminates:

while( 1 )
{
  readOutputFromChildProcess( pid );

  int status;
  s32 retPid = waitpid( pid, &status, WNOHANG );

  if ( retPid < 0 )
  {
     if ( errno == ECHILD )
     {
         // I don't expect to ever get this error - but I do. why?
         printf( "Process gone before previous wait. Return status lost.\n" );
         assert(0); 
     } else {
         // other real errors handled here.
         handleError();
         break;
     }
  }

  if ( retPid == 0 )
  {
     waitSomeTime();
     continue; 
  }

  processValidResults( status );
  break;
}

I have greatly simplified the code. My understanding is that once you spawn a process, the process table entry remains until the caller calls "waitpid" and gets a return value greater than zero, and a valid return status.

But what seems to happen in some cases is that the process terminates on its own, and when I call waitpid, it returns -1, with error ECHILD

ECHILD means that at the time I called waitpid there was no process in the process table with that id. So either my pid was invalid - and I've checked carefully - it is valid.

or - waitpid has already been called after this process finished - in which case I am unable to get the return code from this process.

The program is multi threaded. Also I've check that I'm not calling waitpid too early. It happens after several "waits".

Is there any other way a process table entry gets cleaned up without calling waitpid? How can I make sure that I always get the return code?

@Explicitly ignoring SIGCHLD:

Ok, so I understand that explicitly ignoring it will cause waitpid() to fail. I don't explicitly ignore it, but I do set some signal handlers to trap crashes in another place like so:

void kxHandleCrashes()
{
   struct sigaction sa;
   sa.sa_flags = SA_SIGINFO;
   sa.sa_sigaction = abortHandler;
   sigemptyset( &sa.sa_mask );

   sigaction( SIGABRT, &sa, NULL );
   sigaction( SIGSEGV, &sa, NULL );
   sigaction( SIGBUS,  &sa, NULL );
   sigaction( SIGILL,  &sa, NULL );
   sigaction( SIGFPE,  &sa, NULL );
   sigaction( SIGPIPE, &sa, NULL );

   // Should I add aline like this:
   // sigaction( SIGCHLD, &sa, NULL );
}

It happens after several "waits" Are you trying to wait more than once on the same process ? — cnicutar
Yes. That is why I call it with NOHANG. I need to be able to return to my thread periodically to report progress, and also to terminate the called process if it hangs. — Rafael Baptista
I think you can only wait successfully once on a child. After that the kernel cleans up the process information and leaves no trace. — cnicutar
Only if the return is positive. If you call with NOHANG and get a zero return, the process should not be cleaned up. — Rafael Baptista

Nazar Nazar · Accepted Answer · 2014-06-26T00:17:42

I had similar problem - waitpid would just fail with ECHLD. Child process was running, i did not touch SIGCHLD handler (default handler in place), and yet still was getting ECHLD on waitpid every time.

After few hours of investigation it turner out that I forked out children, then demonized parent (which forks it), which effectively turned all children into orphans..

I moved parent daemonization to occur before forking children and everything started to work flawlessly.

So if you get this mysterious ECHLD error, and you did not mess with SIGCHLD signal handler - check if those children are actually still your children, and children's PPID is equal to parent's PID.

waitpid returns ECHILD - but pid was valid

2 Answers