3
votes

Due to a query that was running for more than 100 hours, in Aginity we saw that our cluster size went from 1 to 5 TB.

By inspecting svv_table_info, we saw that the size of each table was way bigger than what we used to see in the past. After that, we checked AWS console, and we saw that size increase started 5 days ago, at the same time that the 100 hours running query has started.

After killing the query, few minutes after Redshift size went back to 1 TB, and each table size went back to normal.

Why is this happening?

Just for the record, 100 hours running query did not involve all of the tables who's size increased dramatically while the query was running.

enter image description here

EDITED

I cannot now really reproduce the error. But the steps were the following:

  • In Aginity, I accidentally see that size of the cluster is 5TB, even though the cluster has only 2 x ds2.xlarge node (total 4TB)

  • I query svv_table_info to get the size of the each of the tables - they all sum up to 5TB, and I see that most of them look amazingly big

  • I see that DWH has all up to date data, even though it was "reportedly" full for at least 2 days (it also went over 4TB of size)

  • I see a running query for 100+ hours, one of Data Analysts left an open notebook. Query did not involve all the tables that seemed unreasonable big

  • I kill the query, after few moments everything goes back to normal

So: -How could Redshift grow to 5TB if we have only 2x2TB = 4TB of space available!

2
when the same happened to me I had a bug in my query that produced a cartesian product of a big table to itself, so n square number of rows... and this spills to disk. check your join conditions carefully - AlexYes
You're assuming that the table size in svv_table_info reflects the actual size on disk and that's not always true. It may appear that total table size > 4TB but this is due to the way svv_table_info calculates the table size, which is approximate. You do only have 4TB of disk regardless of what svv_table_info is telling you. The cluster size did not "go from 1TB to 5TB" - the cluster size is always 4TB (in this case). What you are looking at is the percentage used disk space, what you had initially was about 25% used and then with the big query that ran this went up to 100%. - Nathan Griffiths

2 Answers

0
votes

It happen also to us. Redshift uses disk space when running query, this is why when you kill the query the cluster size get back to normal.

Here a really good article about that https://www.periscopedata.com/blog/disk-based-temporary-tables

0
votes

It may be helpful to distinguish first how Amazon Redshift uses storage during query execution. There are two ways:

  1. Disk-based Queries. When a query runs out of memory, the overflow “spills” to the disk and the query goes “disk-based”.
  2. Intermediate Storage. When a query needs to save the results of an intermediate operation, to use as input for a future operation.

In this case, I think you're looking at use of intermediate storage. Whatever the query calculated, it started filling up the disk with the intermediate results. This happens a lot when a query comes in that joins two very large tables (e.g. each one with billions of rows), typically written by somebody inexperienced in writing OLAP queries. The absolute number of 5TB is less relevant than percentage of disk-spaced used, which is 100% in your case.

We've written a post about how to fix disk-based queries which goes into the very details of Redshift here: https://www.intermix.io/blog/how-to-fix-disk-based-queries-amazon-redshift/