HTTP 503 errors from Cloud Run app in one GCP projects but not the other

Question

The issue

I am using the same container (similar resources) on 2 projects -- production and staging. Both have custom domains setup with cloud flare DNS and are on the same region. Container build is done in a completely different project and IAM is used to handle the access to these containers. Both project services have 80 concurrency and 300 seconds time out for all 5 services.

All was working good 3 days back but from yesterday almost all cloud run services on staging (thankfully) started throwing 503 randomly and for most requests. Some services were not even deployed for a week. The same containers are running fine on production project, no issues.

Ruled out causes

anything to do with Cloudflare (I tried the URL cloud run gives it has the issue of 503)
anything with build or containers (I tried the demo hello world container with go - it has the issue too)
Resources: I tried giving it 1 GB ram and 2 cpus but the problem persisted
issues on deployment (deploy multiple branches - didn't work)
issue in code (just routed traffic to old 2-3 days old revision but still issue was there)
Issue on service level ( I used the same container to create a completely new service, it also had the issue)

Possible causes

something on cloud run or cloud run load balancer
may some env vars but that also doesn't seem to be the issue

Response Codes

I just ran a quick check with vegeta (30 secs with 10 rps) same container on staging and production for a static file path and below are the responses:

Staging

Production

If anyone has any insights on this it would help greatly.

Ahmet Alp Balkan Ahmet Alp Balkan · Accepted Answer · 2020-07-15T01:34:11

Based on your explanation, I cannot understand what's going on. You explained what doesn't work but didn't point out what works (does your app run locally? are you able to run a hello world sample application?)

So I'll recommend some debugging tips.

If you're getting a HTTP 5xx status code, first, check your application's logs. Is it printing ANY logs? Is there logs of a request? Does your application have and deployed with "verbose" logging setting?
Try hitting your *.run.app domain directly. If it's not working, then it's not a domain or dns or cloudflare issue. Try debugging and/or redeploying your app. Deploy something that works first. If *.run.app domain works, then the issue is not in Cloud Run.
Make sure you aren't using Cloudflare in proxy mode (e.g. your DNS points to Cloud Run; not Cloudflare) as there's a known issue about certificate issuance/renewals when domains are behind Cloudflare, right now.

Beyond these, if a redeploy seems to solve your problem, maybe try redeploying. It could be very likely some configuration recently became different two different projects.