Google Cloud Run concurrency limits + autoscaling clarifications

Question

Google Cloud Run allows a specified request concurrency limit per container. The subtext of the input field states "When this concurrency number is reached, a new container instance is started" Two clarification questions:

Is there any way to set Cloud Run to anticipate the concurrency limit being reached, and spawn a new container a little before that happens to ensure that requests over the concurrency limit of Container 1 are seamlessly handled by Container 2 without the cold start time affecting the requests?
Imagine we have Maximum Instances set to 10, Concurrency set to 10 and there are currently 100 requests being processed (i.e. we've maxed our our capacity and cannot autoscale any more). What happens to the 101th request? Will it be queued up for some period of time, or will a 5XX be returned immediately?

John Hanley John Hanley · Accepted Answer · 2021-01-16T18:28:57

Is there any way to set Cloud Run to anticipate the concurrency limit being reached, and spawn a new container a little before that happens to ensure that requests over the concurrency limit of Container 1 are seamlessly handled by Container 2 without the cold start time affecting the requests?

No. Cloud Run does not try to predict future traffic patterns.

Imagine we have Maximum Instances set to 10, Concurrency set to 10 and there are currently 100 requests being processed (i.e. we've maxed our our capacity and cannot autoscale any more). What happens to the 101th request? Will it be queued up for some period of time, or will a 5XX be returned immediately?

HTTP Error 429 Too Many Requests will be returned.

[EDIT - Google Cloud documentation on request queuing]

Under normal circumstances, your revision scales out by creating new instances to handle incoming traffic load. But when you set a maximum instances limit, in some scenarios there will be insufficient instances to meet that traffic load. In that case, incoming requests queue for up to 60 seconds. During this 60 second window, if an instance finishes processing requests, it becomes available to process queued requests. If no instances become available during the 60 second window, the request fails with a 429 error code on Cloud Run (fully managed).

About maximum container instances

Google Cloud Run concurrency limits + autoscaling clarifications

1 Answers