I am running Kubernetes jobs via cron. In some cases the jobs may fail and I want them to restart. I'm scheduling the jobs like this:
kubectl run collector-60053 --schedule=30 10 * * * * --image=gcr.io/myimage/collector --restart=OnFailure --command node collector.js
I'm having a problem where some of these jobs are running and failing but the associated pods are disappearing, so I have no way to look at the logs and they are not restarting.
For example:
$ kubectl get jobs | grep 60053
collector-60053-1546943400 1 0 1h
$ kubectl get pods -a | grep 60053
$ // nothing returned
This is on Google Cloud Platform running 1.10.9-gke.5
Any help would be much appreciated!
EDIT:
I discovered some more information. I have auto-scaling setup on my GCP cluster. I noticed that when the servers are removed the pods are also removed (and their meta data). Is that expected behavior? Unfortunately this gives me no easy way to look at the pod logs.
My theory is that as pods fail, the CrashLoopBackOff kicks in and eventually auto-scaling decides that the node is no longer needed (it doesn't see the pod as an active workload). At this point, the node goes away and so do the pods. I don't think this is expected behavior with Restart OnFailure but I basically witnessed this by watching it closely.
kubectl get pods --selector=job-name=collector-60053 --output=jsonpath={.items..metadata.name})
to get pod name. Then,kubectl logs <pod_name>
– Mauro Baraldi