Improve erlang cowboy performance

Question

We have been using Cowboy in production on our Compute Engine machines on GCP and we started benchmarking and improving the performance of our service to handle more Reqs/sec (in our case since we are in Adtech it is bids/sec).

After isolating and handling a lot of the issues separately we came down to Cowboy optimization, these are our current findings and limitations:

Cowboy setup

We are using Cowboy 2.5 with 200 acceptors and max backlog of 1024

init(Req, _State) ->
    T1 = erlang:monotonic_time(),
    {ok, BRjson, _} = cowboy_req:read_body(Req),
    %% ---- rest of work goes here but is switched off for our test---
    erlang:send_after(60, self(), {'RSP', x, no_workers}),
    {cowboy_loop, Req, #state{t1 = T1}, hibernate}.

Erlang VM

OTP 21

VM args: -smp auto +P 134217727 +K true +A 64 -rate 1200 +stbt db +scl false +sfwi 500 +spp true +zdbbl 8092

Load

Json requests ~4KB in size. And testing is done using a separate machine on the same internal network (no SSL) using jmeter. All requests are POST with keep-alive

Servers

GCP Compute Engine 10 vcpu cores and 14GB RAM (now and tested before with the 4 vcpu)

Findings

We are able to reach to ~1900 reqs/sec but all CPU cores in htop are showing almost 80% utilization

At 1000 reqs/sec we se cpu utilization at 45-50% per core (still high bearing in mind that no other part of our application is running)

*Note: using the 4 vcpu machine we were able to get close to 700 reqs/sec and memory in all of our tests is barely utilizied or changing with load

QUESTION: How to improve cowboy's performance in terms of cpu usage?

I think that 200 acceptors are too much for 1024 backlog. Also I can't see the +Q option for erl. — Pouriya
I just tried with +Q 65000 (the same number we have in ulimit) and that didn't change anything and lowered acceptors to 50 — Halid
whats the point for erlang:send_after/3? and why you want process be hibernated for 60 ms? — Pouriya
This is an async cowboy call; it receives request sends it internally in the system, and another process has to send back the response within 100ms (which is the bid response SLA timeout standard in the industry) — Halid
Did you try this without hibernation? Did you increase ranch max connection? — Pouriya

Halid Halid · Accepted Answer · 2018-12-09T13:29:05

First off, thanks @Pouriya for suggestions--actually, discussing this back and forth made me go back and re-check one of my comments about the right tool for the job. PS: we are on GCP so 72 cores would be out of question at this stage.

Cowboy is great! but it does add a bit of overhead in the critical path of each request--a feature (or issue in my case) that is not needed.

We tested again with Elli (https://github.com/elli-lib/elli) but built a proper testing setup this time and it provided improvement up to 20% ~ exactly what we needed!

If anyone at Cowboy/Ranch team has a way of drastically improving CPU overhead will gladly test since we still use it in our APIs but not the critical path.