1
votes

I'm creating a table on Amazon Redshift for storing a huge amount of data per day.

I tried to optimize the most I could the database performance using sort keys.

The idea is to be able to serve a wep application with these data through a API performing select queries on it.

After multiple different sort keys, I'm not convinced at all that I'm using the right one. I must be missing / misunderstanding something ...

Table definition :

CREATE TABLE test_table(
  date date NOT NULL,
  country char(2) NOT NULL,
  application_id integer NOT NULL,
  device smallint NOT NULL,
  category smallint NOT NULL,
  subcategory smallint DEFAULT NULL,
  rank smallint DEFAULT NULL,
  subrank smallint DEFAULT NULL,
  is_free smallint NOT NULL,
  downloads integer DEFAULT NULL)
  distkey(application_id)

Data context :

  1. Store between 10 000 000 and 20 000 000 lines/ day
  2. Keep 2 years of history

Sort keys I've already tried :

  1. Compound sortkey(device, is_free, date, country, category, subcategory)
  2. Interleaved sortkey(device, is_free, country, category, subcategory)

Performance test executed (on 18 million generated rows) :

With any of these sort keys, example queries below are always performing during 3s and 7s depending number of given countries / categories and date range.

Query examples :

SELECT country, category, sum(downloads)
FROM test_table
WHERE country IN ('fr','jp', 'de', 'gb', 'us')
AND category in (6014, 6005, 6011, 6004, 6017)
AND device = 0
AND is_free = 1
AND date >= '2019-01-01' AND date <= '2019-04-01'
GROUP BY country, category;
SELECT category, country, rank, avg(downloads)
FROM test_table
WHERE country IN ('br','co', 'ru')
AND category in (6009, 6008, 6000)
AND device = 1
AND is_free = 0
AND rank IN (1, 10, 100)
AND date BETWEEN '2019-03-01' AND '2019-04-01'
GROUP BY category, country, rank;
SELECT category, country, application_id, sum(downloads)
FROM test_table
WHERE country IN ('us', 'cn', 'jp')
AND category in (6010, 6003, 6002)
AND device = 1
AND is_free = 1
AND date BETWEEN '2019-04-01' AND '2019-06-01'
GROUP BY category, country, application_id

Is it possible to make it faster ? Are the selected sort keys bad ? Can I put the date field in the interleaved sort key ? (even if I've read this is a bad idea)

If you think Redshift is not made for this case, do you have other database suggestions (I have no restriction on the technology) ?

Thank you in advance for your help :)

1
if you are looking for faster performance than a few seconds and/or want to serve many users through a web app then probably Redshift is not the correct architectural choice. what max response (in seconds) are you looking for and how many users are you expecting? you also need to consider how the data is loaded? (you will want your sort key to match the order in which records are loaded to reduce the need for re-sorting)Jon Scott
@JonScott Thank you for your quick reply. A few seconds are ok, I was just wondering if it was due to bad sort keys and if it was possible to improve: the limit are 30sec but not for 18 millions which is around 1/2 days of data but 1/2 years of dataGabriel
I fear that the performances decrease a lot during the year if I add 10/20 million lines per dayGabriel
The data will be inserted per group of country/date/device : so I will first for example insert 50 000 lines for be - 01/08/19 - iphone then 40 000 lines for cn - 07/09/19 - ipad ,... At this time, there is no order but I can perform all date for be then all dates for cn if it is betterGabriel
how many users (how many queries per 24 hours) ?Jon Scott

1 Answers

2
votes

Redshift is absolutely the right choice for this type of query IMO. See my examples below where I get response time of just a few hundred milliseconds on a tiny cluster.

The date or timestamp column should generally be the first column in your compound sort key. Add other columns in descending order of the number of unique values.

Avoid using INTERLEAVED sort keys for tables that you regularly add data to.

Here's an example using the store_sales table from TPC-DS at 100GB scale: 265 million rows. I converted the ss_sold_date_sk and ss_sold_date_sk surrogate keys to a real timestamp.

--   column    | distinct val
-- ss_hdemo_sk |       7,200
-- ss_promo_sk |       1,000
-- ss_store_sk |         201
-- ss_quantity |         100

CREATE TABLE IF NOT EXISTS "store_sales_ts" (…)
DISTSTYLE KEY
DISTKEY ("ss_item_sk")
SORTKEY ("ss_sold_ts"
        ,"ss_hdemo_sk"
        ,"ss_promo_sk"
        ,"ss_store_sk"
        ,"ss_quantity")
;

Timings are taken on a 2 node dc2.large cluster. Result cache is disabled as shown.

SET enable_result_cache_for_session TO off
;
SELECT ss_store_sk
     , COUNT(*)         AS sales_count
     , AVG(ss_quantity) AS avg_quantity
FROM store_sales_ts
WHERE ss_sold_ts BETWEEN '2001-09-01' AND '2001-09-30'
AND ss_store_sk IN (356,241,160,70)
GROUP BY 1
;
--First run: 5415.869 ms 
--Second run: 1485.217 ms
--Third run: 173.262 ms
--Change month: 337.084 ms

SELECT ss_quantity
     , COUNT(*)         AS sales_count
     , AVG(ss_ext_discount_amt) AS avg_discount_amt
FROM store_sales_ts
WHERE ss_sold_ts BETWEEN '2001-09-01' AND '2001-09-30'
AND ss_quantity > 90
GROUP BY 1
;
--First run: 5717.890 ms
--Second run: 206.465 ms
--Change year: 210.091 ms