Amazon Redshift at 100% disk usage due to VACUUM query

11

votes

Reading the Amazon Redshift documentatoin I ran a VACUUM on a certain 400GB table which has never been vacuumed before, in attempt to improve query performance. Unfortunately, the VACUUM has caused the table to grow to 1.7TB (!!) and has brought the Redshift's disk usage to 100%. I then tried to stop the VACUUM by running a CANCEL query in the super user queue (you enter it by running "set query_group='superuser';") but although the query didn't raise an error, this had no effect on the vaccum query which keeps running.

What can I do?

amazon-web-servicesamazon-redshiftvacuum

9

votes

I have stopped vacuum operation several times. Maybe the feature was not available that time.
Run the below query, which gives you the process id for vacuum query.

select * from stv_recents where status='Running';

Once you have process id you can run the following query to terminate the process.

select pg_terminate_backend( pid );

8

votes

Apparently, currently there is not much you can do. I was on the phone with amazon support for an hours, they didn't have the tools to stop the vacuum operation. They opened a ticket about CANCEL query silently not working on VACUUM queries.

They suggested I take snapshot of the cluster (normally should take a few minutes if you have made previous snapshots), and then that I restart the cluster. It sort of worked, meaning that the vacuum stopped, and some of the disk space was cleared (600GB), but the table remained more than twice its original size. Because vacuuming it again would be too risky, I resorted to creating a deep copy of it, which should created a vacuumed copy of the table. (You can read about deep copy here - http://docs.aws.amazon.com/redshift/latest/dg/performing-a-deep-copy.html).

4

votes

Hint: Run this query: (taken from here) to see what tables you should vacuum.

Note: This will help only in the case where you want to know which tables are big, and what you can gain by vacuuming each one.

select trim(pgdb.datname) as Database,
    trim(a.name) as Table,  ((b.mbytes/part.total::decimal)*100)::decimal(5,2) as pct_of_total, b.mbytes, b.unsorted_mbytes
    from stv_tbl_perm a
    join pg_database as pgdb on pgdb.oid = a.db_id
    join (select tbl, sum(decode(unsorted, 1, 1, 0)) as unsorted_mbytes, count(*) as mbytes
    from stv_blocklist group by tbl) b on a.id=b.tbl
    join ( select sum(capacity) as  total
      from stv_partitions where part_begin=0 ) as part on 1=1
    where a.slice=0
    order by 3 desc, db_id, name;

Then vacuum table(s) with high unsorted_mbytes: VACUUM your_table;

1

votes

Vacuum should be scheduled regularly, if you do vacuum on the table at daily basis, it should be very quick and won't have significant side effect;
In the case you described, it would be safer to scale the cluster up to a larger configuration, then do the vacuum, and then you can scale down to original configuration. Remember that free disk space is crucial for calculations on RedShift cluster, when free disk space goes down, all read/write operations on the cluster will become very slow.

Amazon Redshift at 100% disk usage due to VACUUM query

4 Answers