Google container engine cluster nodes not ready after working

Question

I try to create a cluster on google container engine to host :

a postgresql database with persistent disk (image mdillon/postgis:latest)
a custom nginx/php-fpm image to host a Symfony2 php project

The symfony2 php project is contained in the docker image with all dependencies ("composer install" is run inside the docker file, while building the image).

On start, the entrypoint generate bootstrap cache and warm-up cache, then start php-fpm and nginx.

I do the following to create the cluster :

Create a cluster with 1,2 or more nodes :

eg :

gcloud container clusters create cluster-standard-1 --disk-size 20 --machine-type n1-standard-1 --num-nodes 1 --scopes storage-rw

gcloud container clusters create cluster-micro --disk-size 20 --machine-type f1-micro --num-nodes 3 --scopes storage-rw

I made many tests in many configurations.

Create replication controllers & services :

kubectl create -f pggis-rc.yaml

kubectl create -f pggis-service.yaml

kubectl create -f app-rc.yaml

kubectl create -f app-service.yaml

The app-service exposes a load balancer.

There is only single-container pods with replicas = 1.

It worked very well with a cluster of 2 nodes g1-small. This morning it worked with a cluster of 1 node f1-micro (yes !).

But most of the time, as soon as the pods run, when I try to access the application, the pods are suddenly "pending" (they were in running state just before).

In case of f1-micro and g1-small, I see the following message (in all nodes) :

kubelet: page allocation failure: order:0, mode:0x120

or

docker: page allocation failure: order:0, mode:0x120

(depending on the node) Followed by a kernel dump ... so I thought that it was a "memory" problem.

In case of standard-1 node, there isn't this message.

In all cases, it's followed by a lot of messages like that (continously) :

Oct 21 19:59:22 gke-cluster-standard-1-2f01f811-node-k8m6 accounts-from-metadata: WARNING error while trying to update accounts: [Errno 104] Connection reset by peer Oct 21 19:59:27 gke-cluster-standard-1-2f01f811-node-k8m6 accounts-from-metadata: WARNING error while trying to update accounts: '' Oct 21 19:59:32 gke-cluster-standard-1-2f01f811-node-k8m6 accounts-from-metadata: WARNING error while trying to update accounts: '' Oct 21 19:59:37 gke-cluster-standard-1-2f01f811-node-k8m6 accounts-from-metadata: WARNING error while trying to update accounts: '' Oct 21 19:59:42 gke-cluster-standard-1-2f01f811-node-k8m6 accounts-from-metadata: WARNING error while trying to update accounts: '' Oct 21 19:59:47 gke-cluster-standard-1-2f01f811-node-k8m6 accounts-from-metadata: WARNING error while trying to update accounts: '' Oct 21 19:59:52 gke-cluster-standard-1-2f01f811-node-k8m6 accounts-from-metadata: WARNING error while trying to update accounts: '' Oct 21 19:59:57 gke-cluster-standard-1-2f01f811-node-k8m6 accounts-from-metadata: WARNING error while trying to update accounts: '' Oct 21 20:00:02 gke-cluster-standard-1-2f01f811-node-k8m6 accounts-from-metadata: WARNING error while trying to update accounts: '' Oct 21 20:00:07 gke-cluster-standard-1-2f01f811-node-k8m6 accounts-from-metadata: WARNING error while trying to update accounts: '' Oct 21 20:00:12 gke-cluster-standard-1-2f01f811-node-k8m6 accounts-from-metadata: WARNING error while trying to update accounts: ''

Then the node stay in status : NotReady (kubectl get nodes) and the pods stay in pending state.

So I tried to delete the failed vm-instance (after more than 20 minutes of unavailability), and the a new instance came back, with the same messages as above.

The only solution is to delete the whole cluster and create a new, but after many tests (more than 5), I don't succeed in having a working running application.

page allocation failure: order:0 definitely indicates the node is OOM (good explanation here), but that shouldn't bring down the node, much less the cluster. Are your rc & service manifests posted somewhere so I can try to reproduce? Also, what does "...and the a new instance came back, with the same messages as above." mean? — Tim Allclair
Hi, Here the gist : gist.github.com/rlamarche/bed2b32fe0edec1a62a4 "...and the a new instance came back, with the same messages as above." : when I have deleted the vm instance, a new one has been automatically created, but with the same problem. — Romain
Not that I did'nt have the page allocation failure message with a standard-1 node, but I've had the same problem. — Romain
I have deployed my containers on a g1-small instance on google compute and I did'nt have any OOM problem. — Romain

Robert Bailey Robert Bailey · Accepted Answer · 2016-01-28T05:56:35

It sounds like your application doesn't fit onto such a small machine type, so you should use a larger machine type.

Note that the system overhead is pretty high on the f1-micro machine type because even before you run your application, there are a bunch of system daemons running (docker, kubelet, kube-proxy) along with any cluster add ons (dns, kube-ui, logging, monitoring) that have been deployed.

Google container engine cluster nodes not ready after working

1 Answers