First of all I know it's odd to rely on a manual vacuum from the application layer, but this is how we decided to run it. I have the following stack :
- HikariCP
- JDBC
- Postgres 11 in AWS
Now here is the problem. When we start fresh with brand new tables with autovacuum=off the manual vacuum is working fine. I can see the number of dead_tuples growing up to the threshold then going back to 0. The tables are being updated heavily in parallel connections (HOT is being used as well). At some point the number of dead rows is like 100k jumping up to the threshold and going back to 100k. The n_dead_tuples slowly creeps up.
Now the worst of all is that when you issue vacuum from a pg console ALL the dead tuples are cleaned, but oddly enough when the application is issuing vacuum it's successful, but partially cleans "threshold amount of records", but not all ? Now I am pretty sure about the following:
- Analyze is not running, nor auto-vacuum
- There are no long running transactions
- No replication is going on
- These tables are "private"
Where is the difference between issuing a vacuum from the console with auto-commit on vs JDBC ? Why the vacuum issued from the console is cleaning ALL the tupples whereas the vacuum from the JDBC cleans it only partially ? The JDBC vacuum is ran in a fresh connection from the pool with the default isolation level, yes there are updates going on in parallel, but this is the same as when a vacuum is executed from the console.
Is the connection from the pool somehow corrupted and can not see the updates? Is the ISOLATION the problem? Visibility Map corruption? Index referencing old tuples?
Side-note: I have observed that same behavior with autovacuum on and cost limit through the roof like 4000-8000 , threshold default + 5% . At first the n_dead_tuples is close to 0 for like 4-5 hours... The next day the table is 86gigs with milions of dead tuples. All the other tables are vacuumed and ok...
PS: I will try to log a vac verbose in the JDBC.. PS2: Because we are running in AWS could it be a backup that is causing it to stop cleaning ?
PS3: When refering to vaccum I mean simple vacuum, not full vacuum. We are not issuing full vacuum.
VACUUM (VERBOSE)
would be informative. Setautovacuum_vacuum_cost_delay = 0
for maximal speed. – Laurenz Albe