78
votes

I am running a Flask/Gunicorn Python app on a Heroku Cedar dyno. The app returns JSON responses to its clients (it's an API server, really).

Once in a while clients get 0-byte responses. It's not me returning them, however. Here is a snippet of my app's log:

Mar 14 13:13:31 d.0b1adf0a-0597-4f5c-8901-dfe7cda9bce0 app[web.1] [2013-03-14 13:13:31 UTC] 10.104.41.136 apisrv - api_get_credits_balance(): session_token=[MASKED]

The first line above is me starting to handle the request.

Mar 14 13:13:31 d.0b1adf0a-0597-4f5c-8901-dfe7cda9bce0 app[web.1] [2013-03-14 13:13:31 UTC] 10.104.41.136 apisrv 1252148511 api_get_credits_balance(): returning [{'credits_balance': 0}]

The second line is me returning a value (to Flask -- it's a Flask "Response" object).

Mar 14 13:13:31 d.0b1adf0a-0597-4f5c-8901-dfe7cda9bce0 app[web.1] "10.104.41.136 - - [14/Mar/2013:13:13:31] "POST /get_credits_balance?session_token=MASKED HTTP/1.1" 200 22 "-" "Appcelerator Titanium/3.0.0.GA (iPhone/6.1.2; iPhone OS; en_US;)"

The third line is Gnicorn's, where you can see the Gunicorn got the 200 status and 22 bytes HTTP body ("200 22").

However, the client got 0 bytes. Here is the Heroku router log:

Mar 14 13:13:30 d.0b1adf0a-0597-4f5c-8901-dfe7cda9bce0 heroku[router] at=info method=POST path=/get_credits_balance?session_token=MASKED host=matchspot-apisrv.herokuapp.com fwd="66.87.116.128" dyno=web.1 queue=0 wait=0ms connect=1ms service=19ms status=200 bytes=0

Why does Gunicorn return 22 bytes, but Heroku sees 0, and indeed passes back 0 bytes to the client? Is this a Heroku bug?

1
Did you notice, that heroku timestamp is before your proccess timestamp? Do you use gevent? Something wrong with synchronisation I think.Tigra
And yet, timestamp states 1 second difference, not 1 1ms... I did not work with heroku, so it is only suggestions. 1ms and 1999ms both can give you 1 second difference in timestamp. Service 19ms is also too low to be true on cloud service. So my point is, that probably there are some kind of timeout and on timeout instead of error heroku serves empty page. This suggestion is long shot, but maybe you should emulate long request and see what happensTigra
How helpful was Heroku when you contacted them with this (out of curiosity)?orokusaki
Not very much so far. I approached them 10 days ago, and was told the Python guys would look at it first and if they can't help me then the routing guys will have a look. 5 days later I was informed that the Python guys have passed this to the routing guys, and today I got an email from a "routing guy" saying he could not recreate and asking for some more info. So yes they are going thru the right process, but it's taking forever.Nitzan Shaked
Small update: this hasn't yet been resolved. I've been corresponding back and forth with Heroku support, and the best I can gather right now is that they haven't dismissed me with "it's on your end", and are trying to write a tool that will tcpdump-capture app traffic, for "debugging cases like this".Nitzan Shaked

1 Answers

1
votes

I know I may be considered a little off the wall here but there is another option.

We know that from time to time there is a bug that happens on transit.We know that there is not much we can do right now to stop the problem. If you are only providing the API then stop reading however if you write the client too, keep going.

The error is a known case, and known cause. The result of an empty return value means that something went wrong. However the value is available and was fetched, calculated, whatever... My instinct as a developer would be to treat an empty result as an HTTP error and request the data be resent. You could then track the resend requests and see how often this happens.

I would suggest (although you strike me as the kind of developer to think of this too) that you count the requests and set a sane value for responding "network error" to the user. My instinct would be to retry right away and then to wait a little while before retrying some more.

From what you describe the first retry would probably pick up the data properly. Of course this could mean keeping older requests hanging about in cache for a few minutes or running the request a second time depending on what seemed most appropriate.

This would also route around any number of other point-to-point networking errors and leave the app far more robust even in the face of connectivity problems.

I know our instinct as developers is to fix the known fault but sometimes it is better to work towards a system that is able to operate despite faults. That said it never hurts to log errors and problems and try to fix them anyway.