0
votes

We have 13 (Amazon RDS) monitored databases by debezium deployed in a Kafka Connect cluster. What's happening right now is that 1 of this 13 databases has a replication slot which has an increasing lag.

12 databases has 10 to 120 kB lag, while one has > 700 MB at this moment.

Using

SELECT slot_name, 
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as replicationSlotLag, 
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) as confirmedLag,
       active
FROM pg_replication_slots;

replication slot lag

The replication slot is active; I checked the connector status (GET kafka-connect:8083/<connector-name>/status) and both Connector and Tasks are in RUNNING state.

To add more information, we have enabled heartbeat and heartbeat.action.query to periodically insert a dummy event in outbox table, so I expect to receive a new change every 10 seconds for each monitored database

We already tried to:

  • Check logs for heartbeat thread failure, but we didn't see any exception
  • Restart the cluster, but the lag is still there
  • Check the related heartbeat topic for the lagging database and there are no messages there, even after cluster restart

Anyone has some idea on what's happening ?

1

1 Answers

0
votes

Looks like that executing heartbeat.action.query manually on the lagging database does the trick... Still don't know why or when it happens