13
votes

Jenkins slave going offline during build. How can I fix this , I saw lot of related questions in SO and Jenkins issues but no one gave solution.

My configuration:

Jenkins version 1.651.1, Zuul version 2.1.1.dev393 with one Jenkins master(Ubuntu), 2 slaves(Ubuntu) each has 16GB of RAM Running builds in parallel.

Jenkins master, devstack and both nodepool slaves are in same IP range.

I'm facing an issue when one of the slave completes its build then the java process in both the slaves is getting killed so the other slave going offline.

I found this issue by listing out the processes running in the slaves and observed that java process is getting killed simultaneous in both slaves when one of the slave completed its build and the other slave is still running the build.

Previously I had this issue and that was resolved by switching to Oracle's JDK from Open JDK. Now slaves are using oracle java 1.8.0_111 but now we getting same issue with Oracle-java8 also

Build logs:

01:42:07 Slave went offline during the build
01:42:07 ERROR: Connection was broken: java.io.IOException: Unexpected termination of the channel
01:42:07    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)
01:42:07 Caused by: java.io.EOFException
01:42:07    at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2351)
01:42:07    at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2820)
01:42:07    at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:804)
01:42:07    at java.io.ObjectInputStream.<init>(ObjectInputStream.java:302)
01:42:07    at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48)
01:42:07    at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(    AbstractSynchronousByteArrayCommandTransport.java:34)
01:42:07    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
01:42:07 
01:42:07 Build step 'Execute shell' marked build as failure 
4
Did you look into system messages log? Try to see if this issue and workaround is relevant to your case.Fedor Losev
We saw this very regularly when master got very busy. We then allocate more "CPU"s to it. Not seen it after that(2 months so far).Jayan
How are you running the master? Docker? What is the resource allocation for the master node?Jayan

4 Answers

11
votes

The slaves goes offline, either because

  1. The jobs running onto it are consuming more RAM than it is having or no memory left.

-If this is the case, try to have less number of executors in slaves or have more CPU/RAM in nodes.

  1. Slave cleanup process might be running or some orphan process might be running in back , which is causing the connection break.

-Stop the cleanup process or kill the orphan process, which is consuming the memory.

  1. SSH keys might got changed between master and slaves.

-Need to send the ssh keys to slaves via scp again and need to touch up once again.

Please try once and also read the below articles for more help.

1
votes

I had similar difficulty with Jenkins slave connections on Linux. They would either not start or drop instead of idling.

I discovered the problem was with the Linux shell, and the way it handled remote connections.

After much effort, my solution was:

  • Create a separate user for Jenkins on the master and slave machines.
  • Delete (rm) the ~/.bashrc files for these Jenkins users
  • Bounce the servers, done.

The existence of the bashrc files (even empty ones) corrupted the cluster. That was the only solution that would make the slaves federate in our environment. The docs did not cover this.

You can imagine the "much effort" was basically bouncing the entire cluster with different combinations of bashrc files until finally just deleting them all in frustration.

The enivronment was Centos and Jenkins CI integrated with IBM ClearCase.

Hopefully this solution might help shake something loose in your problem.

0
votes

I fixed this by pointing a static IP in the router's configuration for my build node, maybe because there are too many devices under the router and the IP was seized irregularly.

0
votes

I met the same problem with you, and finally I found that it is the configuration of Energy Saver。When I checked "Prevent computer from sleeping automatically when the display is off" and unchecked "Put hard disks to sleep when possible", the problem was gone, for your information.