I'm creating a table on Amazon Redshift for storing a huge amount of data per day.
I tried to optimize the most I could the database performance using sort keys.
The idea is to be able to serve a wep application with these data through a API performing select queries on it.
After multiple different sort keys, I'm not convinced at all that I'm using the right one. I must be missing / misunderstanding something ...
Table definition :
CREATE TABLE test_table(
date date NOT NULL,
country char(2) NOT NULL,
application_id integer NOT NULL,
device smallint NOT NULL,
category smallint NOT NULL,
subcategory smallint DEFAULT NULL,
rank smallint DEFAULT NULL,
subrank smallint DEFAULT NULL,
is_free smallint NOT NULL,
downloads integer DEFAULT NULL)
distkey(application_id)
Data context :
- Store between 10 000 000 and 20 000 000 lines/ day
- Keep 2 years of history
Sort keys I've already tried :
- Compound sortkey(device, is_free, date, country, category, subcategory)
- Interleaved sortkey(device, is_free, country, category, subcategory)
Performance test executed (on 18 million generated rows) :
With any of these sort keys, example queries below are always performing during 3s and 7s depending number of given countries / categories and date range.
Query examples :
SELECT country, category, sum(downloads)
FROM test_table
WHERE country IN ('fr','jp', 'de', 'gb', 'us')
AND category in (6014, 6005, 6011, 6004, 6017)
AND device = 0
AND is_free = 1
AND date >= '2019-01-01' AND date <= '2019-04-01'
GROUP BY country, category;
SELECT category, country, rank, avg(downloads)
FROM test_table
WHERE country IN ('br','co', 'ru')
AND category in (6009, 6008, 6000)
AND device = 1
AND is_free = 0
AND rank IN (1, 10, 100)
AND date BETWEEN '2019-03-01' AND '2019-04-01'
GROUP BY category, country, rank;
SELECT category, country, application_id, sum(downloads)
FROM test_table
WHERE country IN ('us', 'cn', 'jp')
AND category in (6010, 6003, 6002)
AND device = 1
AND is_free = 1
AND date BETWEEN '2019-04-01' AND '2019-06-01'
GROUP BY category, country, application_id
Is it possible to make it faster ? Are the selected sort keys bad ? Can I put the date field in the interleaved sort key ? (even if I've read this is a bad idea)
If you think Redshift is not made for this case, do you have other database suggestions (I have no restriction on the technology) ?
Thank you in advance for your help :)