The issue
I am using the same container (similar resources) on 2 projects -- production and staging. Both have custom domains setup with cloud flare DNS and are on the same region. Container build is done in a completely different project and IAM is used to handle the access to these containers. Both project services have 80 concurrency and 300 seconds time out for all 5 services.
All was working good 3 days back but from yesterday almost all cloud run services on staging (thankfully) started throwing 503 randomly and for most requests. Some services were not even deployed for a week. The same containers are running fine on production project, no issues.
Ruled out causes
- anything to do with Cloudflare (I tried the URL cloud run gives it has the issue of 503)
- anything with build or containers (I tried the demo hello world container with go - it has the issue too)
- Resources: I tried giving it 1 GB ram and 2 cpus but the problem persisted
- issues on deployment (deploy multiple branches - didn't work)
- issue in code (just routed traffic to old 2-3 days old revision but still issue was there)
- Issue on service level ( I used the same container to create a completely new service, it also had the issue)
Possible causes
- something on cloud run or cloud run load balancer
- may some env vars but that also doesn't seem to be the issue
Response Codes
I just ran a quick check with vegeta (30 secs with 10 rps) same container on staging and production for a static file path and below are the responses:
Staging ProductionIf anyone has any insights on this it would help greatly.