14
votes

We've been running Google Cloud Run for a little over a month now and noticed that we periodically have cloud run instances that simply fail with:

The request failed because the HTTP connection to the instance had an error.

This message is nearly always* proceeded by the following message (those are the only messages in the log):

This request caused a new container instance to be started and may thus take longer and use more CPU than a typical request.

* I cannot find, nor recall, a case where that isn't true, but I have not done an exhaustive search.

A few things that may be of importance:

  • Our concurrency level is set to 1 because our requests can take up to the maximum amount of memory available, 2GB.
  • We have received errors that we've exceeded the maximum memory, but we've dialed back our usage to obviate that issue.
  • This message appears to occur shortly after 30 seconds (e.g., 32, 35) and our timeout is set to 75 seconds.
2
Based on your details, I am guessing that your container is crashing. Add some code to log to Stackdriver, implement solid error handling (exceptions depending on the language), etc. The key item is that a new container is starting without a normal shutdown message before. This tells me that your container or maybe even Knative has crashed.John Hanley
@JohnHanley thanks for the suggestion. I believe this happens when the service has either previously scaled back to zero, or it is starting a new container due to increased concurrency. Also this issue seldom happens, nevertheless, I'll see about adding additional error handling to our container.bboe
When Cloud Run scales to zero, you will see a message. Double check to see if there is one. If the service is scaling to zero normally, then perhaps your container is taking too long to accept the first HTTP request (eg opening the port) and Cloud Run thinks your container has failed. 30 seconds is a long time to accept a request. This is not the same as how long to respond to an HTTP Request.John Hanley
@bboe - If additional logging doesn't uncover any error in your code, you may want to file a bug report in the public issue tracker: cloud.google.com/support/docs/issue-trackers (search in page for "run").Martin Omander
The one exception so far that has logged to stackdriver is OSError: [Errno 107] Transport endpoint is not connected in gunicorn when calling accept. It looks like it might be a common issue with gunicorn: github.com/benoitc/gunicorn/issues/1913bboe

2 Answers

3
votes

In my case, this error was always thrown after 120 seconds from receiving the request. I figured out the issue that Node 12 default request timeout is 120 seconds. So If you are using Node server you either can change the default timeout or update Node version to 13 as they removed the default timeout https://github.com/nodejs/node/pull/27558.

0
votes

If your logs didn't catch anything useful, most probably the instance crashes because you run heavy CPU tasks. A mention about this can be found on the Google Issue Tracker:

A common cause for 503 errors on Cloud Run would be when requests use a lot of CPU and as the container is out of resources it is unable to process some requests