3
votes

I have a primary RDS instance with four replicas.

  • Primary Database: Postgres, 4 vCPU, 16GB RAM, us-west-2a

  • Replica1: Postgres, 4 vCPU, 16GB RAM, us-west-2a, 200G SSD (No traffic, just for testing)

  • Replica2: Postgres, 4 vCPU, 16GB RAM, us-west-2b, 200G SSD (No traffic, just for testing)

  • Replica3: Postgres, 2 vCPU, 8GB RAM, us-west-2b, 200G SSD (little traffic)

  • Replica4: Postgres, 2 vCPU, 8GB RAM, us-west-2b, 200G SSD (little traffic)

The lag between primary and read replica exceeds 16 seconds without any heavy IOPS, sometimes 30 seconds.

I have spent a lot of effort on digging the root cause of lag.

Here is the CloudWatch report for a replica without any traffic.

enter image description here

Assumption One: is it caused by IO credit?

Here is the report for IO credit, it's always 100% for the past six hours, I don't think it's caused by the IO issue.

enter image description here

Even I don't think it's caused by IO, I decide to upgrade the disk of the database from GP2 to IO1 with provisioned 3000 IOPS.

but it doesn't work, the lag is still there.

Assumption Two: is it caused by the parameter hot standby?

There is no traffic in the replia! it has nothing to do with postgresql parameter max_standby_streaming_delay and hot standby

Assumption Three: is it caused by Network IO?

the traffic is always less than 1M/s

Assumption Four: Is it caused by long-running queries that triggered in my application?

I create two brand new m5.large PostgreSQL instance to verify this assumption, and use pgbench to benchmark.

  • Primary: M5.large, with 3000 provisioned IOPS.

  • Replica: M5.xlarge, with 1000 provisioned IOPS.

I'm surprised! the lag varies from 0 to 24 seconds.

enter image description here

You may ask why don't you post this problem to aws?

I have asked this question in aws forum, but nobody answers me.

I feel cheated and would like to know the real value of replication lag from your experience.

Questions

AWS Amazon Aurora provides an estimated value (under 100ms) for the lag. Here is my benchmark report, the lag is under 25ms.

enter image description here

when it comes to AWS RDS PostgreSQL:

  • Can anyone tell me what's the normal value of aws RDS PostgreSQL replication lag in the wild?

  • What's the promised estimated value of the replication lag for AWS RDS PostgreSQL?

1

1 Answers

2
votes

According to Read replica limitations with PostgreSQL in RDS documentation:

If no user transactions are occurring on the source DB instance, a PostgreSQL read replica reports a replication lag of up to five minutes.

Can you check the replication lag, when having a script that writes to the database every a few milliseconds, as recommended in this answer?