I try to create a cluster on google container engine to host :
- a postgresql database with persistent disk (image mdillon/postgis:latest)
- a custom nginx/php-fpm image to host a Symfony2 php project
The symfony2 php project is contained in the docker image with all dependencies ("composer install" is run inside the docker file, while building the image).
On start, the entrypoint generate bootstrap cache and warm-up cache, then start php-fpm and nginx.
I do the following to create the cluster :
Create a cluster with 1,2 or more nodes :
eg :
gcloud container clusters create cluster-standard-1 --disk-size 20 --machine-type n1-standard-1 --num-nodes 1 --scopes storage-rw
gcloud container clusters create cluster-micro --disk-size 20 --machine-type f1-micro --num-nodes 3 --scopes storage-rw
I made many tests in many configurations.
Create replication controllers & services :
kubectl create -f pggis-rc.yaml
kubectl create -f pggis-service.yaml
kubectl create -f app-rc.yaml
kubectl create -f app-service.yaml
The app-service exposes a load balancer.
There is only single-container pods with replicas = 1.
It worked very well with a cluster of 2 nodes g1-small. This morning it worked with a cluster of 1 node f1-micro (yes !).
But most of the time, as soon as the pods run, when I try to access the application, the pods are suddenly "pending" (they were in running state just before).
In case of f1-micro and g1-small, I see the following message (in all nodes) :
kubelet: page allocation failure: order:0, mode:0x120
or
docker: page allocation failure: order:0, mode:0x120
(depending on the node) Followed by a kernel dump ... so I thought that it was a "memory" problem.
In case of standard-1 node, there isn't this message.
In all cases, it's followed by a lot of messages like that (continously) :
Oct 21 19:59:22 gke-cluster-standard-1-2f01f811-node-k8m6 accounts-from-metadata: WARNING error while trying to update accounts: [Errno 104] Connection reset by peer Oct 21 19:59:27 gke-cluster-standard-1-2f01f811-node-k8m6 accounts-from-metadata: WARNING error while trying to update accounts: '' Oct 21 19:59:32 gke-cluster-standard-1-2f01f811-node-k8m6 accounts-from-metadata: WARNING error while trying to update accounts: '' Oct 21 19:59:37 gke-cluster-standard-1-2f01f811-node-k8m6 accounts-from-metadata: WARNING error while trying to update accounts: '' Oct 21 19:59:42 gke-cluster-standard-1-2f01f811-node-k8m6 accounts-from-metadata: WARNING error while trying to update accounts: '' Oct 21 19:59:47 gke-cluster-standard-1-2f01f811-node-k8m6 accounts-from-metadata: WARNING error while trying to update accounts: '' Oct 21 19:59:52 gke-cluster-standard-1-2f01f811-node-k8m6 accounts-from-metadata: WARNING error while trying to update accounts: '' Oct 21 19:59:57 gke-cluster-standard-1-2f01f811-node-k8m6 accounts-from-metadata: WARNING error while trying to update accounts: '' Oct 21 20:00:02 gke-cluster-standard-1-2f01f811-node-k8m6 accounts-from-metadata: WARNING error while trying to update accounts: '' Oct 21 20:00:07 gke-cluster-standard-1-2f01f811-node-k8m6 accounts-from-metadata: WARNING error while trying to update accounts: '' Oct 21 20:00:12 gke-cluster-standard-1-2f01f811-node-k8m6 accounts-from-metadata: WARNING error while trying to update accounts: ''
Then the node stay in status : NotReady (kubectl get nodes) and the pods stay in pending state.
So I tried to delete the failed vm-instance (after more than 20 minutes of unavailability), and the a new instance came back, with the same messages as above.
The only solution is to delete the whole cluster and create a new, but after many tests (more than 5), I don't succeed in having a working running application.
page allocation failure: order:0
definitely indicates the node is OOM (good explanation here), but that shouldn't bring down the node, much less the cluster. Are your rc & service manifests posted somewhere so I can try to reproduce? Also, what does "...and the a new instance came back, with the same messages as above." mean? – Tim Allclairpage allocation failure
message with a standard-1 node, but I've had the same problem. – Romain