2
votes

The issue

I am using the same container (similar resources) on 2 projects -- production and staging. Both have custom domains setup with cloud flare DNS and are on the same region. Container build is done in a completely different project and IAM is used to handle the access to these containers. Both project services have 80 concurrency and 300 seconds time out for all 5 services.

All was working good 3 days back but from yesterday almost all cloud run services on staging (thankfully) started throwing 503 randomly and for most requests. Some services were not even deployed for a week. The same containers are running fine on production project, no issues.

Ruled out causes

  • anything to do with Cloudflare (I tried the URL cloud run gives it has the issue of 503)
  • anything with build or containers (I tried the demo hello world container with go - it has the issue too)
  • Resources: I tried giving it 1 GB ram and 2 cpus but the problem persisted
  • issues on deployment (deploy multiple branches - didn't work)
  • issue in code (just routed traffic to old 2-3 days old revision but still issue was there)
  • Issue on service level ( I used the same container to create a completely new service, it also had the issue)

Possible causes

  • something on cloud run or cloud run load balancer
  • may some env vars but that also doesn't seem to be the issue

Response Codes

I just ran a quick check with vegeta (30 secs with 10 rps) same container on staging and production for a static file path and below are the responses:

Staging

Responses for Staging

Production

Good responses for production

If anyone has any insights on this it would help greatly.

2

2 Answers

2
votes

Based on your explanation, I cannot understand what's going on. You explained what doesn't work but didn't point out what works (does your app run locally? are you able to run a hello world sample application?)

So I'll recommend some debugging tips.

  • If you're getting a HTTP 5xx status code, first, check your application's logs. Is it printing ANY logs? Is there logs of a request? Does your application have and deployed with "verbose" logging setting?

  • Try hitting your *.run.app domain directly. If it's not working, then it's not a domain or dns or cloudflare issue. Try debugging and/or redeploying your app. Deploy something that works first. If *.run.app domain works, then the issue is not in Cloud Run.

  • Make sure you aren't using Cloudflare in proxy mode (e.g. your DNS points to Cloud Run; not Cloudflare) as there's a known issue about certificate issuance/renewals when domains are behind Cloudflare, right now.

Beyond these, if a redeploy seems to solve your problem, maybe try redeploying. It could be very likely some configuration recently became different two different projects.

1
votes

See Cloud Run Troubleshooting

https://cloud.google.com/run/docs/troubleshooting

Do you see 503 errors under high load? The Cloud Run (fully managed) load balancer strives to distribute incoming requests over the necessary amount of container instances. However, if your container instances are using a lot of CPU to process requests, the container instances will not be able to process all of the requests, and some requests will be returned with a 503 error code.

To mitigate this, try lowering the concurrency. Start from concurrency = 1 and gradually increase it to find an acceptable value. Refer to Setting concurrency for more details.