1
votes

It's been almost 3 months I have switched my platform to Google Cloud (Compute Engine + Cloud SQL + Cloud Storage).

I am very happy with it but from time to time I noticed big latency on the Cloud SQL server. My VMs from Compute Engine and my Cloud SQL instance are all on the same location (us-1) datacenter.

Since my Java backend makes a lot of SQL queries to generate a server response, the response times may vary from 250-300ms (normal) up to 2s!

In the console, I notice absolutely nothing: no CPU peaks, no read/write peaks, no backup running, nothing. No alert. Last time it happened, it lasted for a few days and then the response times went suddenly better than ever.

I am pretty sure Google works on the infrastructure behind the scenes... But no way to point that out.

So here's my questions:

  • Has anybody else ever had noticed the same kind of problem?
  • It is really annoying for me because my web pages get very slow and I have absolutely no control over it. Plus I loose a lot of time because I generally never first suspect a hardware problem / maintenance but instead something that we introduced in our app. Is it normal or do I have a problem on my SQL instance?
  • Is there anywhere I can have visibility over what's Google doing on the hardware? I know there are maintenance alerts, but for my zone it seems always empty when it happen.

The only option I have for now is to wait and that is really not acceptable.

1
Although your Cloud SQL instance are GCE instances are in the same region (us-central1), could you confirm that they are in the same zone (e.g., us-central1-f)?Adrián
Yes same zone us-central1-a, I do have excellent performances most of the time, it only happens sometimes like right now (while yesterday was back to normal)Christophe Fondacci

1 Answers

0
votes

I suspect that Google does some sort of IO throttling and their algorithm is not very sophisticated. We have a build server which slows down to a crawl if we do more than two builds within an hour. The build that normally takes 15 minutes will run for more than an hour and we usually terminate it and re-run manually later. This question describes a similar problem and the recommended solution is to use larger volumes as they come with more IO allowance.