I have a primary RDS instance with four replicas.
Primary Database: Postgres, 4 vCPU, 16GB RAM, us-west-2a
Replica1: Postgres, 4 vCPU, 16GB RAM, us-west-2a, 200G SSD (No traffic, just for testing)
Replica2: Postgres, 4 vCPU, 16GB RAM, us-west-2b, 200G SSD (No traffic, just for testing)
Replica3: Postgres, 2 vCPU, 8GB RAM, us-west-2b, 200G SSD (little traffic)
Replica4: Postgres, 2 vCPU, 8GB RAM, us-west-2b, 200G SSD (little traffic)
The lag between primary and read replica exceeds 16 seconds without any heavy IOPS, sometimes 30 seconds.
I have spent a lot of effort on digging the root cause of lag.
Here is the CloudWatch report for a replica without any traffic.
Assumption One: is it caused by IO credit?
Here is the report for IO credit, it's always 100% for the past six hours, I don't think it's caused by the IO issue.
Even I don't think it's caused by IO, I decide to upgrade the disk of the database from GP2 to IO1 with provisioned 3000 IOPS.
but it doesn't work, the lag is still there.
Assumption Two: is it caused by the parameter hot standby?
There is no traffic in the replia! it has nothing to do with postgresql parameter max_standby_streaming_delay
and hot standby
Assumption Three: is it caused by Network IO?
the traffic is always less than 1M/s
Assumption Four: Is it caused by long-running queries that triggered in my application?
I create two brand new m5.large PostgreSQL instance to verify this assumption, and use pgbench to benchmark.
Primary: M5.large, with 3000 provisioned IOPS.
Replica: M5.xlarge, with 1000 provisioned IOPS.
I'm surprised! the lag varies from 0 to 24 seconds.
You may ask why don't you post this problem to aws?
I have asked this question in aws forum, but nobody answers me.
I feel cheated and would like to know the real value of replication lag from your experience.
Questions
AWS Amazon Aurora provides an estimated value (under 100ms) for the lag. Here is my benchmark report, the lag is under 25ms.
when it comes to AWS RDS PostgreSQL:
Can anyone tell me what's the normal value of aws RDS PostgreSQL replication lag in the wild?
What's the promised estimated value of the replication lag for AWS RDS PostgreSQL?