A silent timeout when accessing a gunicorn-powered service. How to debug?

Question

I have this file:

cd /opt/webapps/deployed/landing-pages
# 1. Activate the virtualenv
source /home/ec2-user/.virtualenvs/landing-pages/bin/activate
# 2. Start gunicorn process as daemon
gunicorn trescloud_landing.wsgi:application --daemon --bind=127.0.0.1:8888 --pid=/opt/webapps/pid/landing-pages.pid --access-logfile=/opt/webapps/log/landing-pages.access.log --error-logfile=/opt/webapps/log/landing-pages.error.log
# 3. Deactivate the virtualenv
deactivate

When I run this file, I can find the trescloud_landing/wsgi.py file (i.e. I'm in the project's base directory: files like manage.py are in the directory pwd).

I have permission to write both the .access.log and .error.log files, and the .pid file. When I run it, two processes are created:

ec2-user 17171 0.3 0.5 214916 11740 ? S 23:28 0:00 /home/ec2-user/.virtualenvs/landing-pages/bin/python2.7 /home/ec2-user/.virtualenvs/landing-pages/bin/gunicorn trescloud_landing.wsgi:application --daemon --bind=127.0.0.1:8888 --pid=/opt/webapps/pid/landing-pages.pid --access-logfile=/opt/webapps/log/landing-pages.access.log --error-logfile=/opt/webapps/log/landing-pages.error.log

ec2-user 17176 4.8 1.0 235144 20556 ? R 23:28 0:00 /home/ec2-user/.virtualenvs/landing-pages/bin/python2.7 /home/ec2-user/.virtualenvs/landing-pages/bin/gunicorn trescloud_landing.wsgi:application --daemon --bind=127.0.0.1:8888 --pid=/opt/webapps/pid/landing-pages.pid --access-logfile=/opt/webapps/log/landing-pages.access.log --error-logfile=/opt/webapps/log/landing-pages.error.log

And when I consult netstat (sudo netstat -anp | grep 8888) I get something like this:

tcp 0 0 127.0.0.1:8888 0.0.0.0:* LISTEN 17171/python2.7

Which appears to tell me that the server is up.

However when I hit curl (and/or browser, but since it is behind nginx, additional stuff appears which does not appear to give me any additional information) with curl http://127.0.0.1:8888/ the request processing seems to be halted (i.e. never returns. no error is raised. no partial response is generated - it becomes blank and eternal). Naturally, when I hit the url with nginx in middle (i.e. by external link) I get a 504 response (since nginx handles timeouts as any decent proxy should).

By peeking in the error log, I get no significative information (only a [CRITICAL] WORKER TIMEOUT if I access via nginx). Stuff like this is what I see:

2015-11-04 23:35:07 [17171] [CRITICAL] WORKER TIMEOUT (pid:17319)
2015-11-04 23:35:07 [17171] [INFO] 1 workers
2015-11-04 23:35:08 [17319] [INFO] Worker exiting (pid: 17319)
2015-11-04 23:35:08 [17171] [INFO] 1 workers
2015-11-04 23:35:08 [17326] [INFO] Booting worker with pid: 17326
2015-11-04 23:35:08 [17171] [INFO] 1 workers
2015-11-04 23:35:08 [17171] [INFO] 1 workers

Question:

What can be the cause of the error? How can I debug this server? Where do I check?

pip freeze:

dateutils==0.6.6
Django==1.8.4
django-cors-headers==1.1.0
django-xmail-ritual==0.0.11 (*)
djangorestframework==3.2.3 (*)
future==0.15.0
gunicorn==19.1.0
psycopg2==2.6.1
python-cantrips==0.7.1 (*)
python-dateutil==2.4.2
pytz==2015.4
six==1.9.0
wheel==0.24.0

(*) These packages work since I use them in other productive environments without timeout. This application used to work and these requirements were never changed.

Thanks :D.

PyCharm Professional has remote debugging through SSH (and a 30 day full evaluation). — Paulo Scardine

Luis Masuelli Luis Masuelli · Accepted Answer · 2015-11-16T16:30:16

I found the answer as follows:

Run it as runserver in the server. If it takes a lot to start, then you have a somewhat heavy initialization code (perhaps a service, models meta-instantiation, ... it is up to you to see the code in your application). Usually this will suck in your local environment as well, but if it doesn't (and you have the same database engine) check whether you have a local unversioned file in the server, and analyze its content.
If a runserver command runs well but takes a lot to serve a single request (or at least the first request) you should check whether your views or middlewares (custom middlewares, if any) are executing one-time initialization code.
If you have no problem running runserver, then check whether the application runs well by WSGI. You can emulate this by running an interactive interpreter in the same virtualenv and current directory and run the code from myproject.wsgi import application. Perhaps you'll find a time bottleneck as I did. Sometimes django WSGI apps take some time to bootstrap and they do that in the first request they receive (actually, each time gunicorn needs to create a new worker).

In my case, I was at scenario 3. I noticed that adding --timeout=45 (or perhaps 60) to a gunicorn launch configuration I'd give more time to the workers to process a request. Otherwise, a worker is created, takes more than 30 seconds to load, it is killed, it is restarted, tries the same request, takes more than 30 seconds... and an endless loop you get here.

A silent timeout when accessing a gunicorn-powered service. How to debug?

1 Answers