5
votes

I'm trying to adjust my WS to support ~ 20k concurrent users.

No matter what configuration I change I still get the same 6 secs avg response time / per endpoint when my tests hit 2(two)k users and various 502 / 504 errors.

WebService:

CloudFlare <--> Nginx <--> Gunicorn <--> Django/DRF <--> Memcache <---> Postgres

Here's what I tried:

  • Increase gunicorn workers from 4 to 10
  • Increase service(pod) instances from 3 to 10
  • Increase gunicorn worker timeout to 120
  • Increase Nginx proxy_pass timeout to 120

Most endpoints hit the database once every 100 seconds and the other requests get the data from memcache.

Could any one help by pointing out what kind of configuration should I be changing?

Where should I be looking for delays/bottlenecks?

Gunicorn workers clearly are timming out, which I dont undersdand since theres no logic in the WS views. It should be only getting a query from memcache and returning it.

Nginx logs:

latforms/android HTTP/1.1", upstream: "http://10.0.1.17:9090/endpoints/platforms/android", host: "myhost.co"
2018/08/13 23:43:25 [error] 8893#8893: *2809163 upstream timed out (110: Connection timed out) while connecting to upstream, client: 200.211.198.133, server: myhost.co, request: "GET /endpoints/store/products/729 HTTP/1.1", upstream: "http://10.0.1.18:9090/endpoints/store/products/729", host: "myhost.co"
200.211.198.133 - [200.211.198.133] - - [13/Aug/2018:23:43:25 +0000] "GET /endpoints/store/categories/?cat_pk=13081 HTTP/1.1" 200 1718 "-" "python-requests/2.18.4" 627 80.840 [production-service-api-80] 10.0.0.112:9090, 10.0.1.13:9090, 10.0.0.113:9090 0, 0, 11150 40.000, 40.000, 0.840 504, 504, 200
200.211.198.133 - [200.211.198.133] - - [13/Aug/2018:23:43:25 +0000] "GET /endpoints/store/categories/?cat_pk=13081 HTTP/1.1" 200 1718 "-" "python-requests/2.18.4" 689 80.857 [production-service-api-80] 10.0.0.112:9090, 10.0.1.12:9090, 10.0.0.113:9090 0, 0, 11150 40.000, 40.000, 0.857 504, 504, 200
200.211.198.133 - [200.211.198.133] - - [13/Aug/2018:23:43:25 +0000] "GET /endpoints/store/home/ HTTP/1.1" 200 10072 "-" "python-requests/2.18.4" 670 80.580 [production-service-api-80] 10.0.1.13:9090, 10.0.1.11:9090, 10.0.0.112:9090 0, 0, 66511 40.001, 40.002, 0.577 504, 504, 200
200.211.198.133 - [200.211.198.133] - - [13/Aug/2018:23:43:25 +0000] "GET /endpoints/store/products/691/ HTTP/1.1" 200 703 "-" "python-requests/2.18.4" 646 80.486 [production-service-api-80] 10.0.1.8:9090, 10.0.1.13:9090, 10.0.1.12:9090 0, 0, 1968 40.000, 40.000, 0.486 504, 504, 200
200.211.198.133 - [200.211.198.133] - - [13/Aug/2018:23:43:25 +0000] "GET /endpoints/store/products/5458 HTTP/1.1" 301 0 "-" "python-requests/2.18.4" 678 80.444 [production-service-api-80] 10.0.1.13:9090, 10.0.1.12:9090, 10.0.1.17:9090 0, 0, 0 40.000, 40.002, 0.442 504, 504, 301
....
90, 10.0.1.11:9090, 10.0.1.8:9090 0, 0, 1968 40.000, 40.000, 0.584 504, 504, 200
200.211.198.133 - [200.211.198.133] - - [13/Aug/2018:23:43:25 +0000] "GET /endpoints/store/products/5458/ HTTP/1.1" 200 241 "-" "python-requests/2.18.4" 647 80.709 [production-service-api-80] 10.0.0.113:9090, 10.0.1.8:9090, 10.0.0.112:9090 0, 0, 327 40.001, 40.000, 0.708 504, 504, 200
--
2018/08/13 23:43:25 [error] 8766#8766: *2809243 upstream timed out (110: Connection timed out) while connecting to upstream, client: 200.211.198.133, server: myhost.co, request: "GET /endpoints/store/categories/?cat_pk=13081 HTTP/1.1", upstream: "http://10.0.1.13:9090/endpoints/store/categories/?cat_pk=13081", host: "myhost.co"
200.211.198.133 - [200.211.198.133] - - [13/Aug/2018:23:43:25 +0000] "GET /endpoints/store/products/692 HTTP/1.1" 301 0 "-" "python-requests/2.18.4" 677 80.672 [production-service-api-80] 10.0.1.17:9090, 10.0.1.10:9090, 10.0.0.113:9090 0, 0, 0 40.001, 40.001, 0.670 504, 504, 301
200.211.198.133 - [200.211.198.133] - - [13/Aug/2018:23:43:25 +0000] "GET /endpoints/store/products/4608/ HTTP/1.1" 200 553 "-" "python-requests/2.18.4" 647 80.591 [production-service-api-80] 10.0.1.11:9090, 10.0.1.17:9090, 10.0.1.8:9090 0, 0, 1090 40.000, 40.003, 0.588 504, 504, 200

Gunicorn logs:

{"asctime": "2018-08-13 23:42:55,145", "name": "gunicorn.access", "levelname": "INFO", "message": "10.0.0.13 - - [13/Aug/2018:23:42:55 +0000] \"GET /endpoints/store/products/691/ HTTP/1.1\" 200 1968 \"-\" \"python-requests/2.18.4\""}
{"asctime": "2018-08-13 23:42:55,167", "name": "gunicorn.access", "levelname": "INFO", "message": "10.0.0.13 - - [13/Aug/2018:23:42:55 +0000] \"GET /endpoints/store/products/729 HTTP/1.1\" 301 - \"-\" \"python-requests/2.18.4\""}
[2018-08-13 23:42:55 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:36)
[2018-08-13 23:42:55 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:37)
[2018-08-13 23:42:55 +0000] [382] [INFO] Booting worker with pid: 382
[2018-08-13 23:42:55 +0000] [383] [INFO] Booting worker with pid: 383
{"asctime": "2018-08-13 23:42:55,403", "name": "gunicorn.access", "levelname": "INFO", "message": "10.0.0.13 - - [13/Aug/2018:23:42:55 +0000] \"GET /endpoints/store/products/691/ HTTP/1.1\" 200 1968 \"-\" \"python-requests/2.18.4\""}
....
{"asctime": "2018-08-13 23:42:55,184", "name": "gunicorn.access", "levelname": "INFO", "message": "10.0.0.13 - - [13/Aug/2018:23:42:55 +0000] \"GET /endpoints/store/categories/?cat_pk=13081 HTTP/1.1\" 200 11150 \"-\" \"python-requests/2.18.4\""}
{"asctime": "2018-08-13 23:42:55,262", "name": "gunicorn.access", "levelname": "INFO", "message": "10.0.0.13 - - [13/Aug/2018:23:42:55 +0000] \"GET /endpoints/platforms/android HTTP/1.1\" 200 48 \"-\" \"python-requests/2.18.4\""}
{"asctime": "2018-08-13 23:42:55,439", "name": "gunicorn.access", "levelname": "INFO", "message": "10.0.0.13 - - [13/Aug/2018:23:42:55 +0000] \"GET /endpoints/platforms/android HTTP/1.1\" 200 48 \"-\" \"python-requests/2.18.4\""}
--
[2018-08-13 23:42:56 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:31)
{"asctime": "2018-08-13 23:42:56,689", "name": "gunicorn.access", "levelname": "INFO", "message": "10.0.0.13 - - [13/Aug/2018:23:42:56 +0000] \"GET /endpoints/store/products/729/ HTTP/1.1\" 200 2163 \"-\" \"python-requests/2.18.4\""}
{"asctime": "2018-08-13 23:42:56,799", "name": "gunicorn.access", "levelname": "INFO", "message": "10.0.0.13 - - [13/Aug/2018:23:42:56 +0000] \"GET /endpoints/store/products/5458/ HTTP/1.1\" 200 327 \"-\" \"python-requests/2.18.4\""}
1

1 Answers

-1
votes

why you not used uwsgi?

for better working do this

  1. decrease database hit in your codes
  2. increase worker count for gunicorn
  3. diable info logging for gunicorn and nginx

if these configuration not worked for you you must change setup configuration or increase resource of your server.