Debezium heartbeat is not committing LSN

Question

We have 13 (Amazon RDS) monitored databases by debezium deployed in a Kafka Connect cluster. What's happening right now is that 1 of this 13 databases has a replication slot which has an increasing lag.

12 databases has 10 to 120 kB lag, while one has > 700 MB at this moment.

Using

SELECT slot_name, 
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as replicationSlotLag, 
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) as confirmedLag,
       active
FROM pg_replication_slots;

The replication slot is active; I checked the connector status (GET kafka-connect:8083/<connector-name>/status) and both Connector and Tasks are in RUNNING state.

To add more information, we have enabled heartbeat and heartbeat.action.query to periodically insert a dummy event in outbox table, so I expect to receive a new change every 10 seconds for each monitored database

We already tried to:

Check logs for heartbeat thread failure, but we didn't see any exception
Restart the cluster, but the lag is still there
Check the related heartbeat topic for the lagging database and there are no messages there, even after cluster restart

Anyone has some idea on what's happening ?

Usernameless Usernameless · Accepted Answer · 2021-04-30T12:43:47

Looks like that executing heartbeat.action.query manually on the lagging database does the trick... Still don't know why or when it happens

Debezium heartbeat is not committing LSN

1 Answers