1
votes

First of all I know it's odd to rely on a manual vacuum from the application layer, but this is how we decided to run it. I have the following stack :

  1. HikariCP
  2. JDBC
  3. Postgres 11 in AWS

Now here is the problem. When we start fresh with brand new tables with autovacuum=off the manual vacuum is working fine. I can see the number of dead_tuples growing up to the threshold then going back to 0. The tables are being updated heavily in parallel connections (HOT is being used as well). At some point the number of dead rows is like 100k jumping up to the threshold and going back to 100k. The n_dead_tuples slowly creeps up.

Now the worst of all is that when you issue vacuum from a pg console ALL the dead tuples are cleaned, but oddly enough when the application is issuing vacuum it's successful, but partially cleans "threshold amount of records", but not all ? Now I am pretty sure about the following:

  • Analyze is not running, nor auto-vacuum
  • There are no long running transactions
  • No replication is going on
  • These tables are "private"

Where is the difference between issuing a vacuum from the console with auto-commit on vs JDBC ? Why the vacuum issued from the console is cleaning ALL the tupples whereas the vacuum from the JDBC cleans it only partially ? The JDBC vacuum is ran in a fresh connection from the pool with the default isolation level, yes there are updates going on in parallel, but this is the same as when a vacuum is executed from the console.

Is the connection from the pool somehow corrupted and can not see the updates? Is the ISOLATION the problem? Visibility Map corruption? Index referencing old tuples?

Side-note: I have observed that same behavior with autovacuum on and cost limit through the roof like 4000-8000 , threshold default + 5% . At first the n_dead_tuples is close to 0 for like 4-5 hours... The next day the table is 86gigs with milions of dead tuples. All the other tables are vacuumed and ok...

PS: I will try to log a vac verbose in the JDBC.. PS2: Because we are running in AWS could it be a backup that is causing it to stop cleaning ?

PS3: When refering to vaccum I mean simple vacuum, not full vacuum. We are not issuing full vacuum.

1
That sounds pretty unlikely. Yes, VACUUM (VERBOSE) would be informative. Set autovacuum_vacuum_cost_delay = 0 for maximal speed.Laurenz Albe
I have not done vaccum before so I find this quite interesting. Could you show code how you are actually triggering FULL vacuum? Also looking at postgresql.org/docs/current/…, there seems to be a lot of options that could be tried. stackoverflow.com/questions/46982548/…Susan Mustafa
I am not using full vacuumAlex
"Now the worst of all is that when you issue vacuum from a pg console ALL the dead tuples are cleaned, but oddly enough when the application is issuing vacuum it's successful, but partially cleans "threshold amount of records", but not all ?" What specifically are you seeing that causes you to think this?jjanes
@jjanes While everything is up and running I can see the app vacuuming but the n_dead goes from 50k-100k then back to 50k... Vacuum from the console and the n_dead drops to 0..... After a while it starts to creep 10-60 and back 20-70 back to 20k and so on and so forthAlex

1 Answers

1
votes

The main problem was that vacuum is run by another user. The vacuuming that I was seeing was the HOT updates + selects running over that data resulting in on-the-fly vacuum of the page.

Next: Vacuuming is affected by long running transactions ACROSS ALL schemas and tables. Yes, ALL schemas and tables. Changing to the correct user fixed the vacuum, but it will get ignored if there is an open_in_transaction in any other schema.table.

Work maintance memory helps, but in the end when the system is under heavy load all vacuuming is paused.

So we upgraded the DB's resources a bit and added a monitor to help us if there are any issues.