"frozen" Google compute engine instance with PostgreSQL

Question

We run several Debian instances with PostgreSQL on Google compute engine and lately we have already seen several occurrences of the following problem.

Instance becomes suddenly non responsive. We cannot ssh it and we cannot connect to the database. Internal monitoring using telegraf is also not running during that period, no monitoring data collected.

Google monitoring of CPU activity shows very low usage during that period. GCP logs do not show any migration in fact do not show anything at all. Also all internal logs for instance - postgresql log, syslog, logs from periodical cronjobs - show the same gap. Looks like the instance was sort of frozen during that time. We so far noticed it only with PostgreSQL instances since these are heavily used.

Instances run these variants of OS and PG:

Debian 9 with PG 11.9
Debian 9 with PG 10.13

These incidents usually take 10-15 minutes, but in one case it was 1:20 hours. At the end of the incident some PG process is killed by an OOM killer but activity on the database immediately before the incident starts is usually relatively low, CPU usage and memory usage too. So it looks more like an instance has limited resources when it starts again? If it is even possible...

Any idea what could be the cause of these issues or what shall we look for? As I mentioned generally no info in internal logs on Debian during the period of the incident.

UPDATE: To avoid misunderstanding - instances in question are data warehouse database running on N1-highmen-8 machine (8 CPUs and 52 GB RAM) with 5 TB SSD. Or database collecting metrics from internet - custom machine 20 CPUs with 90 GB RAM and 3 TB SSD. All SW up to date.

UPDATE 2: Neither syslog, nor kern.log nor messages do not show anything for the time intervals during instance was non responsive. Immediately after incident telegraf recorded huge average load on CPUs but actually quite small CPU usage and Google monitoring shows very small CPU usage during the whole incident. Also immediately after the end of the incident always one of postgresql processes is killed by OOM killer causing database to go to the recovery mode.

As for PG work_mem parameter - instance collecting metrics (20 CPUs 90 GB RAM, 3 TB SSD) uses 8MB - it only inserts data but usually runs like 500 - 1000 connections.

Second instance is data warehouse analytical database and uses work_mem 128MB because lower numbers caused very bad query plans on majority of queries and usually runs only like 10 - 30 connections.

There was no unusual number of connections immediately before incidents happened on both databases.

UPDATE 3: Analytical database had today several small incidents of the same character. During the last one we stopped instance from GCP GUI and started it again after few minutes. Maybe it caused migration to the different HW. Since this operation instance is running OK.

Jose Luis Delgadillo Jose Luis Delgadillo · Accepted Answer · 2020-09-22T13:44:10

I experienced a similar issue but with a MySQL Instance in GCP, the first issue was related with the type of the VM instance I used, I had a f1-micro machine type on this VM Instance and suddenly I wasn’t able to access the ssh. As this type of VM Instance has only 0.6GB of memory, it became out of memory soon, I changed it to a e2-medium that is value by default and it resolved my problems this time. As the Instance was out of memory the services in the instance started to fail, it was the reason that I can't access my instance.

At another time I started again with similar issues, but this time, the problem was the disk, I only had 10 GB and there was a process filling my disk, when a partition was out of space, the instance started to fail again. I only resized my disk, now my instance disk is 20GB and is working fine.

Having said that, I suggest increasing your resources per your convenience to enhance your performance, because to have the problems you described is a good indicator that your existing machine type is not a good fit for your workloads you run on that instance.

If your situation is the same as mine, you could change the machine type to adjust your memory and you can follow the next steps for these tasks please visit the following link to get further information about it.

Changing a machine type

1.- Go to the VM Instances page.

2.- In the Name column, click your instance.

From the instance details page, complete the following steps:

a) Click the Stop button to stop the instance, if you have not stopped it yet.

b) After the instance stops, click the Edit button at the top of the page.

c) Under the Machine configuration section, select the machine type you want to use, or create a custom machine type to increase only the Memory.

d) Save your changes and start again your VM Instance.

You can resize your disk following this guide or with the following command:

gcloud compute disks resize DISK_NAME --size DISK_SIZE

Or with the Console:

Go to the Disks page to see a list of zonal persistent disks in your project.
Click the name of the disk that you want to resize.
On the disk details page, click Edit.
In the Size field, enter the new size for your disk.
Click Save to apply your changes to the disk.

After you resize the disk, you must resize the file system so that the operating system can access the additional space.

Note: Do not resize boot disks beyond 2 TB because this is the limit.

Edit1

You mentioned that the logs don’t show information about the issue when the instance is frozen. Did you try with the kernel logs? I think it could provide a wealth of diagnostic information about this issue.

For Debian, this logs should be in the following path:

/var/log/kern.log

Also the messages log could help

/var/log/messages

You can obtain more information about the logs in this link.

Also, I think it could be a PostgreSQL config problem, for example you could take a look at "work_mem", this parameter specifies the amount of memory to be used by internal sort operations and hash tables before writing to temporary disk files. The value defaults is four megabytes (4MB).

You can consult this URL to get more information.

Also I have found a good article that explains how to configure the PostgreSQL for Data Warehouse Usage

Another option could be that the kernel process in charge of identifying memory that could be paged out. You could configure your process to check smaller chunks more often.

This link explains better this configuration.

Additionally, as far as I know a data warehouse server consumes a lot of resources, so it could be a good idea to check if your Instance has enough resources for your workload.

Edit2

I have found an article that describes a similar problem and it said that:

When you consume more memory than is available on your machine you can start to see out of out of memory errors within your Postgres logs, or in worse cases the OOM killer can start to randomly kill running processes to free up memory. An out of memory error in Postgres simply errors on the query you’re running, where as the the OOM killer in Linux begins killing running processes which in some cases might even include Postgres itself.

And this is the recommendation they give.

When you see an out of memory error you either want to increase the overall RAM on the machine itself by upgrading to a larger instance OR you want to decrease the amount of memory that work_mem uses. Yes, you read that right: out-of-memory it’s better to decrease work_mem instead of increase since that is the amount of memory that can be consumed by each process and too many operations are leveraging up to that much memory.

You could see the complete explanation of this article “Configuring memory for Postgres” here, it may help you with this issue.

"frozen" Google compute engine instance with PostgreSQL

1 Answers

Edit1

Edit2