2
votes

I'm getting some weird behaviour with my Windows slave being suspended after I rebooted the slave Azure VM (maintenance) and the suspended status not being removed when the slave is back up and running, and connected successfully to Jenkins.

I've replicated this 3 times - with each VM I've been installing extra software manually, which requires a reboot, but after the reboot any build job I schedule hangs with the message "Waiting for next available executor on windows", the slave node shows as online in Jenkins but has "(suspended)" next to it, then eventually the Azure VM Agents plugin just creates a new VM and runs the build on that.

Extract from logs:

May 25, 2018 3:47:29 PM hudson.remoting.SynchronousCommandTransport$ReaderThread run SEVERE: I/O error in channel slavew823810 java.io.IOException: Unexpected termination of the channel at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77) Caused by: java.io.EOFException at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2679) at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3154) at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:862) at java.io.ObjectInputStream.(ObjectInputStream.java:358) at hudson.remoting.ObjectInputStreamEx.(ObjectInputStreamEx.java:48) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)

May 25, 2018 3:47:34 PM hudson.model.AsyncPeriodicWork$1 run INFO: Started Azure VM Agents Clean Task May 25, 2018 3:47:34 PM com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask execute INFO: AzureVMAgentCleanUpTask: execute: start May 25, 2018 3:47:34 PM com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask execute INFO: AzureVMAgentCleanUpTask: execute: Running clean with 5 minute timeout May 25, 2018 3:47:34 PM com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask cleanVMs INFO: AzureVMAgentCleanUpTask: cleanVMs: node slavew56f680 blocked to cleanup May 25, 2018 3:47:34 PM com.microsoft.azure.vmagent.AzureVMManagementServiceDelegate virtualMachineExists INFO: AzureVMManagementServiceDelegate: virtualMachineExists: check for slavew823810 May 25, 2018 3:47:34 PM com.microsoft.rest.interceptors.LoggingInterceptor log INFO: <-- HTTP FAILED: java.net.SocketException: Connection reset May 25, 2018 3:47:34 PM okhttp3.internal.platform.Platform log INFO: ALPN callback dropped: SPDY and HTTP/2 are disabled. Is alpn-boot on the boot class path? May 25, 2018 3:47:35 PM com.microsoft.azure.vmagent.AzureVMManagementServiceDelegate virtualMachineExists INFO: AzureVMManagementServiceDelegate: virtualMachineExists: slavew823810 exists May 25, 2018 3:47:35 PM com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask cleanDeployments INFO: AzureVMAgentCleanUpTask: cleanDeployments: Cleaning deployments May 25, 2018 3:47:35 PM com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask cleanDeployments INFO: AzureVMAgentCleanUpTask: cleanDeployments: Done cleaning deployments May 25, 2018 3:47:35 PM com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask$2 call INFO: AzureVMAgentCleanUpTask: cleanVMs: shutting down slavew823810 May 25, 2018 3:47:35 PM com.microsoft.azure.vmagent.AzureVMAgent shutdown INFO: AzureVMAgent: shutdown: Add suspended status for node slavew823810 May 25, 2018 3:47:35 PM com.microsoft.azure.vmagent.AzureVMAgent shutdown INFO: AzureVMAgent: shutdown: shutting down agent slavew823810 May 25, 2018 3:47:35 PM com.microsoft.azure.vmagent.AzureVMManagementServiceDelegate shutdownVirtualMachine INFO: AzureVMManagementServiceDelegate: shutdownVirtualMachine: called for slavew823810 May 25, 2018 3:47:35 PM com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask execute INFO: AzureVMAgentCleanUpTask: execute: end May 25, 2018 3:47:35 PM hudson.model.AsyncPeriodicWork$1 run INFO: Finished Azure VM Agents Clean Task. 389 ms May 25, 2018 3:47:35 PM okhttp3.internal.platform.Platform log INFO: ALPN callback dropped: SPDY and HTTP/2 are disabled. Is alpn-boot on the boot class path? May 25, 2018 3:49:09 PM hudson.model.AsyncPeriodicWork$1 run INFO: Started Azure VM Maintainer Pool Size May 25, 2018 3:49:09 PM hudson.model.AsyncPeriodicWork$1 run INFO: Finished Azure VM Maintainer Pool Size. 1 ms May 25, 2018 3:51:04 PM hudson.model.AsyncPeriodicWork$1 run INFO: Started Azure VM Verification Task May 25, 2018 3:51:04 PM com.microsoft.azure.vmagent.AzureVMCloudVerificationTask getVirtualMachineCount INFO: AzureVMCloudVerificationTask: getVirtualMachineCount: start May 25, 2018 3:51:04 PM com.microsoft.azure.vmagent.AzureVMCloudVerificationTask getVirtualMachineCount INFO: AzureVMCloudVerificationTask: getVirtualMachineCount: end, cloud Jenkins Azure Cloud has currently 2 vms May 25, 2018 3:51:04 PM hudson.model.AsyncPeriodicWork$1 run INFO: Finished Azure VM Verification Task. 39 ms May 25, 2018 3:52:34 PM hudson.model.AsyncPeriodicWork$1 run INFO: Started Azure VM Agents Clean Task May 25, 2018 3:52:34 PM com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask execute INFO: AzureVMAgentCleanUpTask: execute: start May 25, 2018 3:52:34 PM com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask execute INFO: AzureVMAgentCleanUpTask: execute: Running clean with 5 minute timeout May 25, 2018 3:52:34 PM com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask cleanVMs INFO: AzureVMAgentCleanUpTask: cleanVMs: node slavew56f680 blocked to cleanup May 25, 2018 3:52:34 PM com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask cleanVMs INFO: AzureVMAgentCleanUpTask: cleanVMs: node slavew823810 blocked to cleanup May 25, 2018 3:52:34 PM com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask cleanDeployments INFO: AzureVMAgentCleanUpTask: cleanDeployments: Cleaning deployments May 25, 2018 3:52:34 PM com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask cleanDeployments INFO: AzureVMAgentCleanUpTask: cleanDeployments: Done cleaning deployments May 25, 2018 3:52:34 PM com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask execute INFO: AzureVMAgentCleanUpTask: execute: end May 25, 2018 3:52:34 PM hudson.model.AsyncPeriodicWork$1 run INFO: Finished Azure VM Agents Clean Task. 21 ms May 25, 2018 3:53:00 PM hudson.node_monitors.ResponseTimeMonitor$1 monitor WARNING: Making slavew56f680 offline because it’s not responding May 25, 2018 3:53:09 PM hudson.node_monitors.ResponseTimeMonitor$1 monitor WARNING: Making slavew56f680 offline because it’s not responding May 25, 2018 3:54:09 PM hudson.model.AsyncPeriodicWork$1 run INFO: Started Azure VM Maintainer Pool Size May 25, 2018 3:54:09 PM hudson.model.AsyncPeriodicWork$1 run INFO: Finished Azure VM Maintainer Pool Size. 1 ms May 25, 2018 3:54:56 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher launch INFO: AzureVMAgentSSHLauncher: launch: launch method called for agent slavew56f680 May 25, 2018 3:54:56 PM com.microsoft.rest.interceptors.LoggingInterceptor log INFO: <-- HTTP FAILED: java.net.SocketException: Connection reset May 25, 2018 3:54:56 PM okhttp3.internal.platform.Platform log INFO: ALPN callback dropped: SPDY and HTTP/2 are disabled. Is alpn-boot on the boot class path? May 25, 2018 3:54:56 PM com.microsoft.azure.vmagent.AzureVMManagementServiceDelegate isVMAliveOrHealthy INFO: AzureVMManagementServiceDelegate: isVMAliveOrHealthy: status PowerState/running May 25, 2018 3:54:56 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher connectToSsh INFO: AzureVMAgentSSHLauncher: connectToSsh: start May 25, 2018 3:54:56 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher getRemoteSession INFO: AzureVMAgentSSHLauncher: getRemoteSession: getting remote session for user slavew to host slavew56f680.ukwest.cloudapp.azure.com:22 May 25, 2018 3:54:56 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher getRemoteSession INFO: AzureVMAgentSSHLauncher: getRemoteSession: Got remote session for user slavew to host slavew56f680.ukwest.cloudapp.azure.com:22 May 25, 2018 3:54:56 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher connectToSsh INFO: AzureVMAgentSSHLauncher: connectToSsh: Got remote connection May 25, 2018 3:54:56 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher executeRemoteCommand INFO: AzureVMAgentSSHLauncher: executeRemoteCommand: starting dir C:.azure-agent-init May 25, 2018 3:54:56 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher executeRemoteCommand INFO: AzureVMAgentSSHLauncher: executeRemoteCommand: executed successfully May 25, 2018 3:54:56 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher launch INFO: AzureVMAgentSSHLauncher: launch: checking for java runtime May 25, 2018 3:54:56 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher executeRemoteCommand INFO: AzureVMAgentSSHLauncher: executeRemoteCommand: starting java -fullversion May 25, 2018 3:54:57 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher executeRemoteCommand INFO: AzureVMAgentSSHLauncher: executeRemoteCommand: executed successfully May 25, 2018 3:54:57 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher launch INFO: AzureVMAgentSSHLauncher: launch: java runtime present, copying slave.jar to remote May 25, 2018 3:54:57 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher copyFileToRemote INFO: AzureVMAgentSSHLauncher: copyFileToRemote: Initiating file transfer to slave.jar May 25, 2018 3:55:07 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher copyFileToRemote INFO: AzureVMAgentSSHLauncher: copyFileToRemote: copied file Successfully to slave.jar May 25, 2018 3:55:07 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher launch INFO: AzureVMAgentSSHLauncher: launch: launching agent: java -jar slave.jar May 25, 2018 3:55:07 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher launch INFO: AzureVMAgentSSHLauncher: launch: Connected successfully May 25, 2018 3:55:13 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher launch INFO: AzureVMAgentSSHLauncher: launch: launched agent successfully May 25, 2018 3:57:13 PM com.squareup.okhttp.internal.Platform$JdkWithJettyBootPlatform getSelectedProtocol INFO: ALPN callback dropped: SPDY and HTTP/2 are disabled. Is alpn-boot on the boot class path? May 25, 2018 3:57:18 PM com.squareup.okhttp.internal.Platform$JdkWithJettyBootPlatform getSelectedProtocol INFO: ALPN callback dropped: SPDY and HTTP/2 are disabled. Is alpn-boot on the boot class path? May 25, 2018 3:57:34 PM hudson.model.AsyncPeriodicWork$1 run INFO: Started Azure VM Agents Clean Task May 25, 2018 3:57:34 PM com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask execute INFO: AzureVMAgentCleanUpTask: execute: start May 25, 2018 3:57:34 PM com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask execute INFO: AzureVMAgentCleanUpTask: execute: Running clean with 5 minute timeout May 25, 2018 3:57:34 PM com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask cleanVMs INFO: AzureVMAgentCleanUpTask: cleanVMs: node slavew823810 blocked to cleanup May 25, 2018 3:57:34 PM com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask cleanDeployments INFO: AzureVMAgentCleanUpTask: cleanDeployments: Cleaning deployments May 25, 2018 3:57:34 PM com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask cleanDeployments INFO: AzureVMAgentCleanUpTask: cleanDeployments: Done cleaning deployments May 25, 2018 3:57:34 PM com.microsoft.rest.interceptors.LoggingInterceptor log INFO: <-- HTTP FAILED: java.net.SocketException: Connection reset May 25, 2018 3:57:34 PM okhttp3.internal.platform.Platform log INFO: ALPN callback dropped: SPDY and HTTP/2 are disabled. Is alpn-boot on the boot class path? May 25, 2018 3:57:35 PM com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask execute INFO: AzureVMAgentCleanUpTask: execute: end May 25, 2018 3:57:35 PM hudson.model.AsyncPeriodicWork$1 run INFO: Finished Azure VM Agents Clean Task. 240 ms May 25, 2018 3:59:09 PM hudson.model.AsyncPeriodicWork$1 run INFO: Started Azure VM Maintainer Pool Size May 25, 2018 3:59:09 PM hudson.model.AsyncPeriodicWork$1 run INFO: Finished Azure VM Maintainer Pool Size. 0 ms May 25, 2018 3:59:15 PM com.microsoft.azure.vmagent.AzureVMCloud provision INFO: AzureVMCloud: provision: start for label windows workLoad 1 May 25, 2018 3:59:15 PM com.microsoft.azure.vmagent.AzureVMCloud provision INFO: AzureVMCloud: provision: checking for node reuse options May 25, 2018 3:59:15 PM com.microsoft.azure.vmagent.AzureVMCloud provision INFO: AzureVMCloud: provision: agent computer eligible for reuse slavew823810 May 25, 2018 3:59:15 PM com.microsoft.azure.vmagent.AzureVMManagementServiceDelegate virtualMachineExists INFO: AzureVMManagementServiceDelegate: virtualMachineExists: check for slavew823810 May 25, 2018 3:59:15 PM com.microsoft.azure.vmagent.AzureVMManagementServiceDelegate virtualMachineExists INFO: AzureVMManagementServiceDelegate: virtualMachineExists: slavew823810 exists May 25, 2018 3:59:15 PM com.microsoft.azure.vmagent.AzureVMCloud provision INFO: AzureVMCloud: provision: asynchronous provision finished, returning 1 planned node(s) May 25, 2018 3:59:15 PM com.microsoft.azure.vmagent.AzureVMCloud$2 call INFO: Found existing node, starting VM slavew823810 May 25, 2018 3:59:15 PM com.microsoft.azure.vmagent.AzureVMManagementServiceDelegate startVirtualMachine INFO: AzureVMManagementServiceDelegate: startVirtualMachine: slavew823810 May 25, 2018 3:59:15 PM org.jenkinsci.plugins.cloudstats.CloudStatistics logTypeNotSupported INFO: No support for cloud-stats plugin by class hudson.slaves.NodeProvisioner$PlannedNode May 25, 2018 3:59:15 PM hudson.slaves.NodeProvisioner$StandardStrategyImpl apply INFO: Started provisioning slavew823810 from Jenkins Azure Cloud with 1 executors. Remaining excess workload: -0.282 May 25, 2018 3:59:24 PM com.microsoft.azure.vmagent.AzureVMManagementServiceDelegate setVirtualMachineDetails INFO: Azure agent details: nodeNameslavew823810 adminUserName=3b22afb5-97a8-4da1-9436-18a770b5c131 shutdownOnIdle=true retentionTimeInMin=0 labels=windows May 25, 2018 3:59:24 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher launch INFO: AzureVMAgentSSHLauncher: launch: launch method called for agent slavew823810 May 25, 2018 3:59:48 PM com.microsoft.azure.vmagent.AzureVMManagementServiceDelegate isVMAliveOrHealthy INFO: AzureVMManagementServiceDelegate: isVMAliveOrHealthy: status PowerState/running May 25, 2018 3:59:48 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher connectToSsh INFO: AzureVMAgentSSHLauncher: connectToSsh: start May 25, 2018 3:59:48 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher getRemoteSession INFO: AzureVMAgentSSHLauncher: getRemoteSession: getting remote session for user slavew to host slavew823810.ukwest.cloudapp.azure.com:22 May 25, 2018 3:59:50 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher getRemoteSession INFO: AzureVMAgentSSHLauncher: getRemoteSession: Got remote session for user slavew to host slavew823810.ukwest.cloudapp.azure.com:22 May 25, 2018 3:59:50 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher connectToSsh INFO: AzureVMAgentSSHLauncher: connectToSsh: Got remote connection May 25, 2018 3:59:50 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher executeRemoteCommand INFO: AzureVMAgentSSHLauncher: executeRemoteCommand: starting dir C:.azure-agent-init May 25, 2018 3:59:51 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher executeRemoteCommand INFO: AzureVMAgentSSHLauncher: executeRemoteCommand: executed successfully May 25, 2018 3:59:51 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher launch INFO: AzureVMAgentSSHLauncher: launch: checking for java runtime May 25, 2018 3:59:51 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher executeRemoteCommand INFO: AzureVMAgentSSHLauncher: executeRemoteCommand: starting java -fullversion May 25, 2018 3:59:52 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher executeRemoteCommand INFO: AzureVMAgentSSHLauncher: executeRemoteCommand: executed successfully May 25, 2018 3:59:52 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher launch INFO: AzureVMAgentSSHLauncher: launch: java runtime present, copying slave.jar to remote May 25, 2018 3:59:52 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher copyFileToRemote INFO: AzureVMAgentSSHLauncher: copyFileToRemote: Initiating file transfer to slave.jar May 25, 2018 4:00:02 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher copyFileToRemote INFO: AzureVMAgentSSHLauncher: copyFileToRemote: copied file Successfully to slave.jar May 25, 2018 4:00:02 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher launch INFO: AzureVMAgentSSHLauncher: launch: launching agent: java -jar slave.jar May 25, 2018 4:00:02 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher launch INFO: AzureVMAgentSSHLauncher: launch: Connected successfully May 25, 2018 4:00:13 PM com.microsoft.azure.vmagent.remote.AzureVMAgentSSHLauncher launch INFO: AzureVMAgentSSHLauncher: launch: launched agent successfully May 25, 2018 4:00:13 PM com.microsoft.azure.vmagent.AzureVMCloud$2 call INFO: Remove suspended status for node: slavew823810 May 25, 2018 4:00:15 PM hudson.slaves.NodeProvisioner$2 run INFO: slavew823810 provisioning successfully completed. We have now 3 computer(s) May 25, 2018 4:00:31 PM com.squareup.okhttp.internal.Platform$JdkWithJettyBootPlatform getSelectedProtocol INFO: ALPN callback dropped: SPDY and HTTP/2 are disabled. Is alpn-boot on the boot class path? May 25, 2018 4:00:44 PM org.jenkinsci.plugins.workflow.job.WorkflowRun finish INFO: tokio-utp/master #29 completed: FAILURE

So as you can see from the logs, at 3:47 I rebooted the slave node "slavew56f680" manually.

By 3:54 the slave node was back up & running so I clicked to launch agent, which connected successfully (but in the UI it still said suspended).

At 3:57 I clicked to to build a job, expecting the suspended status to be removed and the job run on this slave.

At 3:59 It moved onto another slave ("slavew823810"), put it online and ran the job on it.

In this particular example slavew823810 was the Windows slave that this happened to before, I just never deleted it and it started it back up. If I had deleted it then it would have created a new windows slave VM.

Anyone seen this before? Any idea how I can stop this happening or remove the suspended status?

Jenkins version is 2.107.3 My master (Linux) and slave (Windows) are both on Azure.

Thanks in advance, Stephen.

Update

After having my slaves powered down over the weekend I started them in Azure then once connected in Jenkins (but saying suspended) I started a build job - result was that after a long delay it created another Windows slave machine to run the job on. I now have 3 Windows slaves. I was hoping it would use one of the previous 2 machines and so I wouldn't be blocked as they had been updated with the extra software and patches I need.

2

2 Answers

3
votes

Try system groovy command (or similar for Azure VM):

Jenkins.instance.getNode('Node-Name').getComputer().setAcceptingTasks(true)

There seems to be no other convenient way to unsuspend.

We got similar predicament when setting our nodes to:

Availability: "Bring this agent on-line according to a schedule

then flipping them back to "Keep this slave on-line as much as possible" while they were still in the downtime window. Node was online but suspended.

Could not find any documentation on this or why there's no GUI toggle option like take off-line or Disconnect/Launch.

1
votes

For anyone else reading this, the "azure vm agents" plugin does a poor job of handling errors returned by azure. you're safest bet is to disconnect the agent, shut it down in azure, and then run the following from the jenkins script console

Jenkins.instance.getNode('').toComputer().setAcceptingTasks(false); Jenkins.instance.getNode('').toComputer().setEligibleForReuse(true);

this should put it back into offline/suspended mode