2
votes

This morning I learned about the (unfortunate) default in kubernetes of all previously run cronjobs' jobs instances being retained in the cluster. Mea culpa for not reading that detail in the documentation. I also notice that deleting jobs (kubectl delete job [<foo> or --all]) takes quite a long time. Further, I noticed that even a reasonably provisioned kubernetes cluster with three large nodes appears to fail (get timeouts of all sorts when trying to use kubectl) when there are just ~750 such old jobs in the system (plus some other active containers that otherwise had not entailed heavy load) [Correction: there were also ~7k pods associated with those old jobs that were also retained :-o]. (I did learn about the configuration settings to limit/avoid storing old jobs from cronjobs, so this won't be a problem [for me] in the future.)

So, since I couldn't find documentation for kubernetes about this, my (related) questions are:

  1. what exactly is stored when kubernetes retains old jobs? (Presumably it's the associated pod's logs and some metadata, but this doesn't explain why they seemed to place such a load on the cluster.)
  2. is there a way to see the resources (disk only, I assume, but maybe there is some other resource) that individual or collective old jobs are using?
  3. why does deleting a kubernetes job take on the order of a minute?
1

1 Answers

2
votes

I don't know if k8s provides that kinda details of what job is consuming how much disk space but here is something you can try.

Try to find the pods associated with the job:

kubectl get pods --selector=job-name=<job name> --output=jsonpath={.items..metadata.name}

Once you know the pod then find the docker container associated with it:

kubectl describe pod <pod name>

In the above output look for Node & Container ID. Now go on that node and in that node goto path /var/lib/docker/containers/<container id found above> here you can do some investigation to find out what is wrong.