OpenShift HAProxy scaling is just not working

Question

I've been trying to get OpenShift's HAProxy scaling working with ny NodeJS Express 4 app (it's essentially a REST API), but I haven't had much luck.

I'm using loader.io's stress testing tools, with a mere 100 users/minute (ramps up from 0), as I'm sure at least NodeJS/Express should be able to handle that. Now granted, this does generate roughly 10-20k requests in 60 seconds, but still.

What happens after the requests start pounding the server, is that I can see CPU go up, memory stays pretty solid and HAProxy's log file is letting me know that it's about to scale up.

It never does. HAProxy crashes before it can scale, then I lose the SSH connection to the OpenShift host. It comes back after a while, though.

At one point I did see that it was hitting the default 128 connection limit, then trying to spin up another gear, but since the requests kept coming in, I'm guessing it just couldn't handle it?

At first I thought that it was due to using a small gear, as I was running 'top' and saw that the CPU load spiked through the roof and I eventually disconnected.

I deleted the app and switched to small.highcpu gears (which cost money per hour).

Still crashes when it's supposed to scale up (with less than 100 concurrent users).

The small.highcpu gear does do something different though, because after it restarts, it adds a new gear, but it does NOT scale down (even though all traffic has stopped), so I have to manually scale down

If I leave the second gear up and try to stress test again with 100 users within 1 minute, HAProxy still goes down (memory usage and CPU seem to be OK) and I lose the SSH connection shortly afterwards. Also, this time it does NOT come up by itself. I also receive the following error in my NodeJS app:

{ [Error: socket hang up] code: 'ECONNRESET' }
{ [Error: socket hang up] code: 'ECONNRESET', sslError: undefined }

If I manually restart HAProxy after this (I kinda have to since it's not coming up), I can see that the local-gear is down, while the second gear is up, meaning that my NodeJS app crashed on the first gear, but stayed online on the second gear.

Is this really intended behaviour? Should I be doing something differently when dealing with NodeJS and HAProxy?

I really can't justify paying for a service such as this, if I can't even handle 100 users/minute, since I'm certain that I will eventually peak far beyond a 100.

UPDATE: Here's a loader.io graph/report, which kinda shows when HAProxy is giving up: http://ldr.io/1tV2iwj

UPDATE 2: I tried using Blitz instead of loader.io, just to be certain on when HAProxy goes crazy. Blitz ended up with 12k hits, 26k errors and 4k timeouts.

Additionally, HAProxy went down and seemed like it would never come back up. This time I decided to wait, and after a few minutes, the local-gear DID come back up. It didn't bring up any additional gears, though.

Here's also what HAProxy was telling me when the Blitz test happened (before it crashed and I disconnected):

==> app-root/logs/haproxy_ctld.log <==
I, [2014-10-13T07:14:48.857616 #74934]  INFO -- : add-gear - capacity: 143.75% gear_count: 1 sessions: 23 up_thresh: 90.0%

==> app-root/logs/haproxy.log <==
[WARNING] 285/071506 (74918) : Server express/local-gear is DOWN, reason: Layer7 timeout, check duration: 10002ms. 0 active and 0 backup servers left. 128 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 285/071506 (74918) : proxy 'express' has no server available!
[WARNING] 285/071511 (74918) : Server express/local-gear is DOWN for maintenance.

UPDATE 3: Tried again with Blitz, this time HAProxy/NodeJS didn't come back up, but instead got stuck on the following line (I can still SSH in):

DEBUG: Sending SIGTERM to child...

There's not much of a pattern here, except that HAProxy isn't doing what it's supposed to be doing: scaling. I'm fairly confident that it's not my NodeJS app at fault here, as it's not reporting any errors (to the log file or to New Relic).

Andy Grimm Andy Grimm · Accepted Answer · 2014-10-16T18:22:53

Your gear is running out of memory, and thus all of your processes are being killed. (that's why you are also getting kicked out of your ssh session.) When that happens, it could potentially put the haproxy configuration in a bad state, and if it does not automatically repair itself on a restart I would consider that to be a bug.

OpenShift HAProxy scaling is just not working

1 Answers