AWS RDS with Postgres : Is OOM killer configured

Question

We are running load test against an application that hits a Postgres database.

During the test, we suddenly get an increase in error rate. After analysing the platform and application behaviour, we notice that:

CPU of Postgres RDS is 100%
Freeable memory drops on this same server

And in the postgres logs, we see:

2018-08-21 08:19:48 UTC::@:[XXXXX]:LOG: server process (PID XXXX) was terminated by signal 9: Killed

After investigating and reading documentation, it appears one possibility is linux oomkiller running having killed the process.

But since we're on RDS, we cannot access system logs /var/log messages to confirm.

So can somebody:

confirm that oom killer really runs on AWS RDS for Postgres
give us a way to check this ?
give us a way to compute max memory used by Postgres based on number of connections ?

I didn't find the answer here:

Smells like the OOM killer alright. Reduce max_connections and/or work_mem. — Laurenz Albe
thanks, but what is the formula ? If have 32 GB available on RDS instance. Regards — Loc Ann
There is no formula. It is shared_buffers + all memory used by PostgreSQL backends. Each backend can use work_mem per sort, hash or bitmap operation, but there can be several such operations in a query. — Laurenz Albe
Thanks. In my case, work_mem=4MB, shared_buffers=7994MB and I have 1000 Connections in AWS monitoring dashboard and max_connections=3518 , AFAIU, I would consume up to 12GB, my box has 32 GiB, so I don't understand how I can reach the oom limits. — Loc Ann
I don't understand either, but your max_connections is insanely high. If you need that many connections, use a connection pool. — Laurenz Albe

Fabio Manzano Fabio Manzano · Accepted Answer · 2018-09-06T13:18:15

AWS maintains a page with best practices for their RDS service: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_BestPractices.html

In terms of memory allocation, that's the recommendation:

An Amazon RDS performance best practice is to allocate enough RAM so that your working set resides almost completely in memory. To tell if your working set is almost all in memory, check the ReadIOPS metric (using Amazon CloudWatch) while the DB instance is under load. The value of ReadIOPS should be small and stable. If scaling up the DB instance class—to a class with more RAM—results in a dramatic drop in ReadIOPS, your working set was not almost completely in memory. Continue to scale up until ReadIOPS no longer drops dramatically after a scaling operation, or ReadIOPS is reduced to a very small amount. For information on monitoring a DB instance's metrics, see Viewing DB Instance Metrics.

Also, that's their recommendation to troubleshoot possible OS issues:

Amazon RDS provides metrics in real time for the operating system (OS) that your DB instance runs on. You can view the metrics for your DB instance using the console, or consume the Enhanced Monitoring JSON output from Amazon CloudWatch Logs in a monitoring system of your choice. For more information about Enhanced Monitoring, see Enhanced Monitoring

There's a lot of good recommendations there, including query tuning.

Note that, as a last resort, you could switch to Aurora, which is compatible with PostgreSQL:

Aurora features a distributed, fault-tolerant, self-healing storage system that auto-scales up to 64TB per database instance. Aurora delivers high performance and availability with up to 15 low-latency read replicas, point-in-time recovery, continuous backup to Amazon S3, and replication across three Availability Zones.

EDIT: talking specifically about your issue w/ PostgreSQL, check this Stack Exchange thread -- they had a long connection with auto commit set to false.

We had a long connection with auto commit set to false:

connection.setAutoCommit(false)

During that time we were doing a lot of small queries and a few queries with a cursor:

statement.setFetchSize(SOME_FETCH_SIZE)

In JDBC you create a connection object, and from that connection you create statements. When you execute the statments you get a result set.

Now, every one of these objects needs to be closed, but if you close statement, the entry set is closed, and if you close the connection all the statements are closed and their result sets.

We were used to short living queries with connections of their own so we never closed statements assuming the connection will handle the things once it is closed.

The problem was now with this long transaction (~24 hours) which never closed the connection. The statements were never closed. Apparently, the statement object holds resources both on the server that runs the code and on the PostgreSQL database.

My best guess to what resources are left in the DB is the things related to the cursor. The statements that used the cursor were never closed, so the result set they returned never closed as well. This meant the database didn't free the relevant cursor resources in the DB, and since it was over a huge table it took a lot of RAM.

Hope it helps!

AWS RDS with Postgres : Is OOM killer configured

2 Answers