I am having no end of trouble getting rate limiting to work on nginx with passenger/rails.
Part of the confusion comes with distinguishing between which aspects of the config work on a per-client basis and which are global limits.
I'm having issues getting my head around the ideal setup for nginx's limit_req and limit_req_zone configs. It seems to vaguely flip flop between language which hints that this is either user-specific or applies globally.
In the docs it is quite vague exactly how the limit_req_zone
line works. Is this 'zone' global or per-user? Given the following line am I right in the following conclusions:
limit_req_zone $binary_remote_addr zone=update_requests:1m rate=20r/s;
- $binary_remote_addr represents a user's IP address
- This representation in particular is preferable because it takes up less space than $remote_addr? Why is this important or preferable?
- The 'zone' (in this case) is filled up with representations of their IP address...?
- 'rate' is the rate at which requests are allowed to leave the queue?
- This 'rate' and 'zone' - are they client-specific or global?
I'm also unsure about the limit_req line, e.g. for this:
limit_req zone=main_site burst=10 nodelay;
- Not entirely sure what burst means. The docs are vague here too. I guess this is a number of requests. Why number of requests, when the rest of the requests system uses this bizarre 'zone' system?
- 'burst' requests are per....what timeframe?
- 'nodelay', as far as I understand, is meant to serve a 503 error immediately if they have other requests in the queue, rather than waiting for the queue to finish. a) wait how long? b) does this mean that the 'burst' setting is ignored in this case?
Thanks.
Some background info in case anyone is really bored and wants to have a look at the config and general issues we're trying to resolve:
At the moment I have this (extract):
limit_req_zone $binary_remote_addr zone=main_site:10m rate=40r/s;
limit_req_zone $binary_remote_addr zone=update_requests:1m rate=20r/s;
server {
listen 80;
server_name [removed];
root [removed];
include rtmp_proxy_settings;
try_files $uri /system/maintenance.html @passenger;
location @passenger {
passenger_max_request_queue_size 0; # 256;
limit_rate_after 2048k;
limit_rate 512k;
limit_req zone=main_site burst=10 nodelay;
limit_conn addr 5;
passenger_enabled on;
passenger_min_instances 3;
}
location ~ ^/update_request {
passenger_enabled on;
limit_req zone=update_requests burst=5 nodelay;
}
gzip on;
gzip_min_length 1000;
gzip_proxied expired no-cache no-store private auth;
gzip_types text/plain application/xml application/javascript text/javascript text/css;
gzip_disable "msie6";
gzip_http_version 1.1;
}
We have two zones defined:
a) "main_site", designed to catch everything b) "update_request", JS on the client polls this via AJAX for updated content when a timestamp in a small (cached) file changes
By its nature this tends to mean that we have fairly low traffic for 1 or 2 minutes but then a massive spike when potentially 10,000 clients all hit the server at once for this updated content (served from the DB in a slightly different way depending on filters, access permissions, etc)
We were finding that during times of heavy load the site was grinding to a halt when the CPU cores were maxed out - we had a few bugs in our updating code which meant that when the connection was dropped the queries queued up and just kept bogging the server down until we had to take the site down temporarily and force users to logout and refresh their browser... effectively we DDoS'd ourselves :P I think this was originally caused by some connectivity issues on our hosting company's side causing a bunch of requests to queue up in the user's browser.
While we ironed out the bugs we warned clients that they might receive the odd 503 "heavy load" message or see the content not updating in a timely fashion. The original intent of the rate limiting was to ensure that the everyday pages of the site could continue to be navigated around even during heavy load, while rate limiting the updated content.
However the main issue we are seeing now is that even after the bugs in the updating code have been (hopefully) ironed out, we can't quite strike a good balance on the rate limiting. Everything we set seems to generate an unhealthy number of 503 errors in the access logs whenever a new piece of content is added to the site (and pulled by our users all at once)
We are looking at various solutions here in terms of caching but ideally we would still like to be protected by some kind of rate limiting which doesn't affect users during day to day operations.