We have added a dockerised build agent to our development Kubernetes cluster which we use to build our applications as part of our Azure Devops pipelines. We created our own image based on the deprecated Microsoft/vsts-agent-docker on Github.
The build agent uses Docker outside of Docker (DooD) to create images on our development cluster.
This agent was working well for a few days but then an error would occasionally occur on the docker commands in our build pipeline:
Error response from daemon: No such image: fooproject:ci-3284.2 /usr/local/bin/docker failed with return code: 1
We realised that the build agent was creating tons of images that weren't being removed. There were tons of images that were blocking up the build agent and there were missing images, which would explain the "no such image" error message.
By adding a step to our build pipelines with the following command we were able to get our build agent working again:
docker system prune -f -a
But of course this then removes all our images, and they must be built from scratch every time, which causes our builds to take an unnecessarily long time.
I'm sure this must be a solved problem but I haven't been able to locate any documentation on the normal strategy for dealing with a dockerised build agent becoming clogged over time. Being new to docker and kubernetes I may simply not know what I am looking for. What is the best practice for creating a dockerised build agent that stays clean and functional, while maintaining a cache?
EDIT: Some ideas:
- Create a build step that cleans up all but the latest image for the given pipeline (this might still clog the build server though).
- Have a cron job run that removes all the images every x days (this would result in slow builds the first time after the job is run, and could still clog the build server if it sees heavy usage.
- Clear all images nightly and run all builds outside of work hours. This way builds would run quickly during the day. However heavy usage could still clog the build server.
EDIT 2:
I found someone with a docker issue on Github that seems to be trying to do exactly the same thing as me. He came up with a solution which he described as follows:
I was exactly trying to figure out how to remove "old" images out of my automated build environment without removing my build dependencies. This means I can't just remove by age, because the nodejs image might not change for weeks, while my app builds can be worthless in literally minutes.
docker image rm $(docker image ls --filter reference=docker --quiet)
That little gem is exactly what I needed. I dropped my repository name in the reference variable (not the most self-explanatory.) Since I tag both the build number and latest the
docker image rm
command fails on the images I want to keep. I really don't like using daemon errors as a protection mechanism, but its effective.
Trying to follow these directions, I have applied the latest
tag to everything that is built during the process, and then run
docker image ls --filter reference=fooproject
If I try to remove these I get the following error:
Error response from daemon: conflict: unable to delete b870ec9c12cc (must be forced) - image is referenced in multiple repositories
Which prevents the latest one from being removed. However this is not exactly a clean way of doing this. There must be a better way?