We run several Debian instances with PostgreSQL on Google compute engine and lately we have already seen several occurrences of the following problem.
Instance becomes suddenly non responsive. We cannot ssh it and we cannot connect to the database. Internal monitoring using telegraf is also not running during that period, no monitoring data collected.
Google monitoring of CPU activity shows very low usage during that period. GCP logs do not show any migration in fact do not show anything at all. Also all internal logs for instance - postgresql log, syslog, logs from periodical cronjobs - show the same gap. Looks like the instance was sort of frozen during that time. We so far noticed it only with PostgreSQL instances since these are heavily used.
Instances run these variants of OS and PG:
- Debian 9 with PG 11.9
- Debian 9 with PG 10.13
These incidents usually take 10-15 minutes, but in one case it was 1:20 hours. At the end of the incident some PG process is killed by an OOM killer but activity on the database immediately before the incident starts is usually relatively low, CPU usage and memory usage too. So it looks more like an instance has limited resources when it starts again? If it is even possible...
Any idea what could be the cause of these issues or what shall we look for? As I mentioned generally no info in internal logs on Debian during the period of the incident.
UPDATE: To avoid misunderstanding - instances in question are data warehouse database running on N1-highmen-8 machine (8 CPUs and 52 GB RAM) with 5 TB SSD. Or database collecting metrics from internet - custom machine 20 CPUs with 90 GB RAM and 3 TB SSD. All SW up to date.
UPDATE 2: Neither syslog, nor kern.log nor messages do not show anything for the time intervals during instance was non responsive. Immediately after incident telegraf recorded huge average load on CPUs but actually quite small CPU usage and Google monitoring shows very small CPU usage during the whole incident. Also immediately after the end of the incident always one of postgresql processes is killed by OOM killer causing database to go to the recovery mode.
As for PG work_mem parameter - instance collecting metrics (20 CPUs 90 GB RAM, 3 TB SSD) uses 8MB - it only inserts data but usually runs like 500 - 1000 connections.
Second instance is data warehouse analytical database and uses work_mem 128MB because lower numbers caused very bad query plans on majority of queries and usually runs only like 10 - 30 connections.
There was no unusual number of connections immediately before incidents happened on both databases.
UPDATE 3: Analytical database had today several small incidents of the same character. During the last one we stopped instance from GCP GUI and started it again after few minutes. Maybe it caused migration to the different HW. Since this operation instance is running OK.