The situation is as follows:
We have a main application and a watcher application. Both of them are c++ applications. Both of them use daemon(1,0) function.
Watcher checks if main application is running and if it finds that main process is absent (crashed) or that main does not respond (applications 'talk' to each other through TCP and thats how it knows if it hung) then it runs the main or restarts it.
Now, TCP settings for the connection can be changed by the user, and it is done through main app. After the change, watcher must be restarted to load the new configuration. That is done from the main app.
As it is, it works fine.
1. On startup Main app DOES kill existing watcher process and runs it again. [This is correct]
2. Watcher app DOES kill main and runs it again. [This is correct]
BUT
- If i run Main, which in turn starts Watcher,
- then kill the Main so the Watcher is left alone.
- Watcher sees that there is no Main anymore and so it starts it again.
- Main starts again, kills the watcher and tries to start it again....
- and at this point, some kind of nonesence happens. It starts the watcher (i can see that TCP port being taken through netstat command), but there is no process named watcher.
If normally netstat shows tcp 0 0 IP:TCP_PORT LISTEN Watcher
, now it shows tcp 0 0 IP:TCP_PORT LISTEN Main
.
It is as if watcher is there, but inside the Main process.
I use scripts to run applications. Watcher uses this
#!/bin/sh
killall -9 Main
./Main
And runs it like system("./runMain.sh&");
Main uses this
#!/bin/sh
killall -9 Watcher
./Watcher
And runs it like system("./runWatcher.sh&");
What am i doing wrong? How do i run them so they could restart each other when needed and always start in separate processes?
So far i have also tried running the scripts using the nohup
, result is the same.
EDIT 1:
Note: numbers here are just for clarity. In reality PID is not 1 of course.
I run Main. netstat shows me:
tcp 0 0 192.168.0.1:7000 LISTEN (PID 1)Main
tcp 0 0 192.168.0.1:7001 LISTEN (PID 1)MainMain starts the Watcher using the script. Now netstat shows me:
tcp 0 0 192.168.0.1:7000 LISTEN (PID 1)Main
tcp 0 0 192.168.0.1:7001 LISTEN (PID 1)Main
tcp 0 0 192.168.0.1:8000 LISTEN (PID 2)WatcherNow, i manually kill Main by doing
killall -9 Main
. Now netstat shows me:tcp 0 0 192.168.0.1:7000 LISTEN (PID 2)Watcher
tcp 0 0 192.168.0.1:7001 LISTEN (PID 2)Watcher
tcp 0 0 192.168.0.1:8000 LISTEN (PID 2)WatcherNotice the change in who owns the listening sockets now? How did that happen?
Watcher sees that Main is gone and so it starts it using the script file.
Main kills the Watcher on startup. Netstat shows:
tcp 0 0 192.168.0.1:7000 LISTEN (PID 3)Main
tcp 0 0 192.168.0.1:7001 LISTEN (PID 3)Main
tcp 0 0 192.168.0.1:8000 LISTEN (PID 3)Main
And thats it. Watcher never runs again.
I tried to debug in Eclipse, Watcher crashes without throwing anything right on the line daemon(1,0)
.
It is as if watcher is there, but inside the Main process.
but surely you know already that this isn't likely (and given how you're starting things... system() starts a shell, which then starts a new shell for your script, which then does whatever it does), you need to challenge the other things you're certain of and instead of assuming all is as it should be (since clearly it isn't), find positive proof of those things... such as Watcher starting and not then stopping. – mahdaemon()
line. 1 more step and it crashes. – user1651105