4
votes

We have been using Cowboy in production on our Compute Engine machines on GCP and we started benchmarking and improving the performance of our service to handle more Reqs/sec (in our case since we are in Adtech it is bids/sec).

After isolating and handling a lot of the issues separately we came down to Cowboy optimization, these are our current findings and limitations:

Cowboy setup

We are using Cowboy 2.5 with 200 acceptors and max backlog of 1024

init(Req, _State) ->
    T1 = erlang:monotonic_time(),
    {ok, BRjson, _} = cowboy_req:read_body(Req),
    %% ---- rest of work goes here but is switched off for our test---
    erlang:send_after(60, self(), {'RSP', x, no_workers}),
    {cowboy_loop, Req, #state{t1 = T1}, hibernate}.

Erlang VM

OTP 21

VM args: -smp auto +P 134217727 +K true +A 64 -rate 1200 +stbt db +scl false +sfwi 500 +spp true +zdbbl 8092

Load

Json requests ~4KB in size. And testing is done using a separate machine on the same internal network (no SSL) using jmeter. All requests are POST with keep-alive

Servers

GCP Compute Engine 10 vcpu cores and 14GB RAM (now and tested before with the 4 vcpu)

Findings

We are able to reach to ~1900 reqs/sec but all CPU cores in htop are showing almost 80% utilization

At 1000 reqs/sec we se cpu utilization at 45-50% per core (still high bearing in mind that no other part of our application is running)

*Note: using the 4 vcpu machine we were able to get close to 700 reqs/sec and memory in all of our tests is barely utilizied or changing with load


QUESTION: How to improve cowboy's performance in terms of cpu usage?

1
I think that 200 acceptors are too much for 1024 backlog. Also I can't see the +Q option for erl. - Pouriya
I just tried with +Q 65000 (the same number we have in ulimit) and that didn't change anything and lowered acceptors to 50 - Halid
whats the point for erlang:send_after/3? and why you want process be hibernated for 60 ms? - Pouriya
This is an async cowboy call; it receives request sends it internally in the system, and another process has to send back the response within 100ms (which is the bid response SLA timeout standard in the industry) - Halid
Did you try this without hibernation? Did you increase ranch max connection? - Pouriya

1 Answers

1
votes

First off, thanks @Pouriya for suggestions--actually, discussing this back and forth made me go back and re-check one of my comments about the right tool for the job. PS: we are on GCP so 72 cores would be out of question at this stage.

Cowboy is great! but it does add a bit of overhead in the critical path of each request--a feature (or issue in my case) that is not needed.

We tested again with Elli (https://github.com/elli-lib/elli) but built a proper testing setup this time and it provided improvement up to 20% ~ exactly what we needed!

If anyone at Cowboy/Ranch team has a way of drastically improving CPU overhead will gladly test since we still use it in our APIs but not the critical path.