AWS RDS PostgreSQL: what's the promised value for PostgreSQL replication lag?

Question

I have a primary RDS instance with four replicas.

Primary Database: Postgres, 4 vCPU, 16GB RAM, us-west-2a
Replica1: Postgres, 4 vCPU, 16GB RAM, us-west-2a, 200G SSD (No traffic, just for testing)
Replica2: Postgres, 4 vCPU, 16GB RAM, us-west-2b, 200G SSD (No traffic, just for testing)
Replica3: Postgres, 2 vCPU, 8GB RAM, us-west-2b, 200G SSD (little traffic)
Replica4: Postgres, 2 vCPU, 8GB RAM, us-west-2b, 200G SSD (little traffic)

The lag between primary and read replica exceeds 16 seconds without any heavy IOPS, sometimes 30 seconds.

I have spent a lot of effort on digging the root cause of lag.

Here is the CloudWatch report for a replica without any traffic.

Assumption One: is it caused by IO credit?

Here is the report for IO credit, it's always 100% for the past six hours, I don't think it's caused by the IO issue.

Even I don't think it's caused by IO, I decide to upgrade the disk of the database from GP2 to IO1 with provisioned 3000 IOPS.

but it doesn't work, the lag is still there.

There is no traffic in the replia! it has nothing to do with postgresql parameter max_standby_streaming_delay and hot standby

the traffic is always less than 1M/s

I create two brand new m5.large PostgreSQL instance to verify this assumption, and use pgbench to benchmark.

I'm surprised! the lag varies from 0 to 24 seconds.

You may ask why don't you post this problem to aws?

I have asked this question in aws forum, but nobody answers me.

I feel cheated and would like to know the real value of replication lag from your experience.

AWS Amazon Aurora provides an estimated value (under 100ms) for the lag. Here is my benchmark report, the lag is under 25ms.

when it comes to AWS RDS PostgreSQL:

Can anyone tell me what's the normal value of aws RDS PostgreSQL replication lag in the wild?
What's the promised estimated value of the replication lag for AWS RDS PostgreSQL?

Adamantike Adamantike · Accepted Answer · 2021-01-29T15:53:04

If no user transactions are occurring on the source DB instance, a PostgreSQL read replica reports a replication lag of up to five minutes.

Can you check the replication lag, when having a script that writes to the database every a few milliseconds, as recommended in this answer?