1
votes

The situation is as follows:

We have a main application and a watcher application. Both of them are c++ applications. Both of them use daemon(1,0) function.

Watcher checks if main application is running and if it finds that main process is absent (crashed) or that main does not respond (applications 'talk' to each other through TCP and thats how it knows if it hung) then it runs the main or restarts it.

Now, TCP settings for the connection can be changed by the user, and it is done through main app. After the change, watcher must be restarted to load the new configuration. That is done from the main app.

As it is, it works fine.
1. On startup Main app DOES kill existing watcher process and runs it again. [This is correct]
2. Watcher app DOES kill main and runs it again. [This is correct]

BUT

  1. If i run Main, which in turn starts Watcher,
  2. then kill the Main so the Watcher is left alone.
  3. Watcher sees that there is no Main anymore and so it starts it again.
  4. Main starts again, kills the watcher and tries to start it again....
  5. and at this point, some kind of nonesence happens. It starts the watcher (i can see that TCP port being taken through netstat command), but there is no process named watcher.

If normally netstat shows tcp 0 0 IP:TCP_PORT LISTEN Watcher, now it shows tcp 0 0 IP:TCP_PORT LISTEN Main.

It is as if watcher is there, but inside the Main process.

I use scripts to run applications. Watcher uses this

#!/bin/sh
killall -9 Main
./Main

And runs it like system("./runMain.sh&");

Main uses this

#!/bin/sh
killall -9 Watcher
./Watcher

And runs it like system("./runWatcher.sh&");

What am i doing wrong? How do i run them so they could restart each other when needed and always start in separate processes?

So far i have also tried running the scripts using the nohup, result is the same.

EDIT 1:

Note: numbers here are just for clarity. In reality PID is not 1 of course.

  1. I run Main. netstat shows me:

    tcp 0 0 192.168.0.1:7000 LISTEN (PID 1)Main
    tcp 0 0 192.168.0.1:7001 LISTEN (PID 1)Main

  2. Main starts the Watcher using the script. Now netstat shows me:

    tcp 0 0 192.168.0.1:7000 LISTEN (PID 1)Main
    tcp 0 0 192.168.0.1:7001 LISTEN (PID 1)Main
    tcp 0 0 192.168.0.1:8000 LISTEN (PID 2)Watcher

  3. Now, i manually kill Main by doing killall -9 Main. Now netstat shows me:

    tcp 0 0 192.168.0.1:7000 LISTEN (PID 2)Watcher
    tcp 0 0 192.168.0.1:7001 LISTEN (PID 2)Watcher
    tcp 0 0 192.168.0.1:8000 LISTEN (PID 2)Watcher

    Notice the change in who owns the listening sockets now? How did that happen?

  4. Watcher sees that Main is gone and so it starts it using the script file.

  5. Main kills the Watcher on startup. Netstat shows:

    tcp 0 0 192.168.0.1:7000 LISTEN (PID 3)Main
    tcp 0 0 192.168.0.1:7001 LISTEN (PID 3)Main
    tcp 0 0 192.168.0.1:8000 LISTEN (PID 3)Main

And thats it. Watcher never runs again. I tried to debug in Eclipse, Watcher crashes without throwing anything right on the line daemon(1,0).

1
How do you know Watcher isn't starting but then terminating due to an error? Instrumenting your daemons with some logging to a file might be helpful.mah
Well, because Watcher is the only app using the TCP_PORT. Lets say its 8000. Main never listens to 8000, only the Watcher does. In the last case, however, it shows that main listens to 8000.user1651105
You stated It is as if watcher is there, but inside the Main process. but surely you know already that this isn't likely (and given how you're starting things... system() starts a shell, which then starts a new shell for your script, which then does whatever it does), you need to challenge the other things you're certain of and instead of assuming all is as it should be (since clearly it isn't), find positive proof of those things... such as Watcher starting and not then stopping.mah
See my edit 1. Its quite hard to debug, as it does not throw exceptions, nothing. It just terminates the app. I could place the breakpoint right on the daemon() line. 1 more step and it crashes.user1651105

1 Answers

0
votes

How about using a custom signal (or even listening on another port for admin commands)? Using the kill -9 is playing with the process tree such as the child process gaining control of the parent's resources (ports, etc.)

Then, on top of that, when the Main process is started by the Watcher, why does it assume that the running instance of Watcher should be killed? One reason is now Watcher is the parent of Main, so I can see how that could cause trouble.

It comes down to the need for the two processes to communicate outside of the 'kill' signal.

Use a semaphore or some other OS-level communication mechanism to coordinate between the two.