We have 13 (Amazon RDS) monitored databases by debezium deployed in a Kafka Connect cluster. What's happening right now is that 1 of this 13 databases has a replication slot which has an increasing lag.
12 databases has 10 to 120 kB lag, while one has > 700 MB at this moment.
Using
SELECT slot_name,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as replicationSlotLag,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) as confirmedLag,
active
FROM pg_replication_slots;
The replication slot is active; I checked the connector status (GET kafka-connect:8083/<connector-name>/status) and both Connector and Tasks are in RUNNING state.
To add more information, we have enabled heartbeat and heartbeat.action.query to periodically insert a dummy event in outbox table, so I expect to receive a new change every 10 seconds for each monitored database
We already tried to:
- Check logs for heartbeat thread failure, but we didn't see any exception
- Restart the cluster, but the lag is still there
- Check the related heartbeat topic for the lagging database and there are no messages there, even after cluster restart
Anyone has some idea on what's happening ?
