2
votes

I follow the Argo Workflow's Getting Started documentation. Everything goes smooth until I run the first sample workflow as described in 4. Run Sample Workflows. The workflow just gets stuck in the pending state:

vagrant@master:~$ argo submit --watch https://raw.githubusercontent.com/argoproj/argo/master/examples/hello-world.yaml
Name:                hello-world-z4lbs
Namespace:           default
ServiceAccount:      default
Status:              Pending
Created:             Thu May 14 12:36:45 +0000 (now)

vagrant@master:~$ argo list
NAME                STATUS    AGE   DURATION   PRIORITY
hello-world-z4lbs   Pending   27m   0s         0

Here it was mentioned that taints on the muster node may be the problem, so I untainted the master node:

vagrant@master:~$ kubectl taint nodes --all node-role.kubernetes.io/master-
node/master untainted
taint "node-role.kubernetes.io/master" not found
taint "node-role.kubernetes.io/master" not found

Then I deleted the pending workflow and resubmitted it, but it got stuck in the pending state again.

The details of the newly submitted workflow that is also stuck:

vagrant@master:~$ kubectl describe workflow hello-world-8kvmb
Name:         hello-world-8kvmb
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  argoproj.io/v1alpha1
Kind:         Workflow
Metadata:
  Creation Timestamp:  2020-05-14T13:57:44Z
  Generate Name:       hello-world-
  Generation:          1
  Managed Fields:
    API Version:  argoproj.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:generateName:
      f:spec:
        .:
        f:arguments:
        f:entrypoint:
        f:templates:
      f:status:
        .:
        f:finishedAt:
        f:startedAt:
    Manager:         argo
    Operation:       Update
    Time:            2020-05-14T13:57:44Z
  Resource Version:  16780
  Self Link:         /apis/argoproj.io/v1alpha1/namespaces/default/workflows/hello-world-8kvmb
  UID:               aa82d005-b7ac-411f-9d0b-93f34876b673
Spec:
  Arguments:
  Entrypoint:  whalesay
  Templates:
    Arguments:
    Container:
      Args:
        hello world
      Command:
        cowsay
      Image:  docker/whalesay:latest
      Name:   
      Resources:
    Inputs:
    Metadata:
    Name:  whalesay
    Outputs:
Status:
  Finished At:  <nil>
  Started At:   <nil>
Events:         <none>

While trying to get the workflow-controller logs I get the follwoing error:

vagrant@master:~$ kubectl logs -n argo -l app=workflow-controller
Error from server (BadRequest): container "workflow-controller" in pod "workflow-controller-6c4787844c-lbksm" is waiting to start: ContainerCreating

The details for the corresponding workflow-controller pod:

vagrant@master:~$ kubectl -n argo describe pods/workflow-controller-6c4787844c-lbksm
Name:           workflow-controller-6c4787844c-lbksm
Namespace:      argo
Priority:       0
Node:           node-1/192.168.50.11
Start Time:     Thu, 14 May 2020 12:08:29 +0000
Labels:         app=workflow-controller
                pod-template-hash=6c4787844c
Annotations:    <none>
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  ReplicaSet/workflow-controller-6c4787844c
Containers:
  workflow-controller:
    Container ID:  
    Image:         argoproj/workflow-controller:v2.8.0
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      workflow-controller
    Args:
      --configmap
      workflow-controller-configmap
      --executor-image
      argoproj/argoexec:v2.8.0
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from argo-token-pz4fd (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  argo-token-pz4fd:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  argo-token-pz4fd
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                  Age                      From             Message
  ----     ------                  ----                     ----             -------
  Normal   SandboxChanged          7m17s (x4739 over 112m)  kubelet, node-1  Pod sandbox changed, it will be killed and re-created.
  Warning  FailedCreatePodSandBox  2m18s (x4950 over 112m)  kubelet, node-1  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "1bd1fd11dfe677c749b4a1260c29c2f8cff0d55de113d154a822e68b41f9438e" network for pod "workflow-controller-6c4787844c-lbksm": networkPlugin cni failed to set up pod "workflow-controller-6c4787844c-lbksm_argo" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/

I run Argo 2.8:

vagrant@master:~$ argo version
argo: v2.8.0
  BuildDate: 2020-05-11T22:55:16Z
  GitCommit: 8f696174746ed01b9bf1941ad03da62d312df641
  GitTreeState: clean
  GitTag: v2.8.0
  GoVersion: go1.13.4
  Compiler: gc
  Platform: linux/amd64

I have checked the cluster status and it looks OK:

vagrant@master:~$ kubectl get nodes
NAME     STATUS   ROLES    AGE   VERSION
master   Ready    master   95m   v1.18.2
node-1   Ready    <none>   92m   v1.18.2
node-2   Ready    <none>   92m   v1.18.2

As to the K8s cluster installation, I created it using Vagrant as described here, the only differences being:

  • libvirt as provdier
  • newer version of Ubuntu: generic/ubuntu1804
  • newer version of Calico: v3.14

Any idea why the workflows get stuck in the pending state and how to fix it?

2
Can you add the workflow-controller logs as well as the result of kubectl describe workflow hello-world-z4lbs?Michael Crenshaw
Also what version of Argo are you running?Michael Crenshaw
@MichaelCrenshaw I've updated the question with the info.SergiyKolesnikov
Perfect! So, workflow-controller is what sort of "ushers" the workflows through their steps. Can you kubectl describe the workflow-controller pod to see why it's stuck in ContainerCreating?Michael Crenshaw
@MichaelCrenshaw I've googled the Calico error and it seems like a common one. Anyways it does not seem to be an Argo problem anymore. Thank you very much for the on-line debugging session!SergiyKolesnikov

2 Answers

2
votes

Workflows start in the Pending state and then are moved through their steps by the workflow-controller pod (which is installed in the cluster as part of Argo).

The workflow-controller pod is stuck in ContainerCreating. kc describe po {workflow-controller pod} reveals a Calico-related network error.

As mentioned in the comments, it looks like a common Calico error. Once you clear that up, your hello-world workflow should execute just fine.

Note from OP: Further debugging confirms the Calico problem (Calico nodes are not in the running state):

vagrant@master:~$ kubectl get pods --all-namespaces
NAMESPACE     NAME                                       READY   STATUS              RESTARTS   AGE
argo          argo-server-84946785b-94bfs                0/1     ContainerCreating   0          3h59m
argo          workflow-controller-6c4787844c-lbksm       0/1     ContainerCreating   0          3h59m
kube-system   calico-kube-controllers-74d45555dd-zhkp6   0/1     CrashLoopBackOff    56         3h59m
kube-system   calico-node-2n9kt                          0/1     CrashLoopBackOff    72         3h59m
kube-system   calico-node-b8sb8                          0/1     Running             70         3h56m
kube-system   calico-node-pslzs                          0/1     CrashLoopBackOff    67         3h56m
kube-system   coredns-66bff467f8-rmxsp                   0/1     ContainerCreating   0          3h59m
kube-system   coredns-66bff467f8-z4lbq                   0/1     ContainerCreating   0          3h59m
kube-system   etcd-master                                1/1     Running             2          3h59m
kube-system   kube-apiserver-master                      1/1     Running             2          3h59m
kube-system   kube-controller-manager-master             1/1     Running             2          3h59m
kube-system   kube-proxy-k59ks                           1/1     Running             2          3h59m
kube-system   kube-proxy-mn96x                           1/1     Running             1          3h56m
kube-system   kube-proxy-vxj8b                           1/1     Running             1          3h56m
kube-system   kube-scheduler-master                      1/1     Running             2          3h59m
0
votes

For the calico CrashLoopBackOff, kubeadm use the default interface eth0 to bootstrap the cluster.
But the eth0 interface is used by Vagrant (for ssh).
You could configure the kubelet to use a private IP address (for instance) and not eth0. You'll have to do that for each node then vagrant reload.

sudo vim /etc/systemd/system/kubelet.service.d/10-kubeadm.conf

#Add the Environment line in 10-kubeadm.conf and replace your_node_ip
Environment="KUBELET_EXTRA_ARGS=--node-ip=your_node_ip"

Hope it helps