working some numbers for a new Postgres build and wanted some advice on partitioning/sizing as I have belatedly realised that I'm about to create a 40+ billion row table and keep adding another 1.5 billion rows per year.
I'm a recent immigrant to Postgres from MSSQL and so still trying to work out what is possible/advisable...
This is the current table design:
security_id int NOT NULL, -- 5,000-10,000 securities
ratio_id smallint NOT NULL, -- ~100 ratios
period_id smallint NOT NULL, -- between 1 and 5 periods
rank_id smallint NOT NULL, -- between 1 and 5 different ways to rank
rankvalue smallint NOT NULL CHECK (ratiovalue between 0 and 101),
validrangez tstzrange NOT NULL -- 30 years of dailyish data.
With the date range some records don't change for months, others change daily, and timezone matters which is why I'm using a range. There is a gist constraint to avoid overlaps.
Most of the queries will be looking at a particular date in the validrangez and then joining with other tables for everything at that date.
I am thinking of partitioning by the year of the upper(validrangez).
Question 1. Should I turn the period_id and rank_id fields into columns?
The upside seems to be that this would turn the table from a 40 billion row table into a 3-4 billion row table which seems more manageable as each partition would only be 100-150m rows rather than a billion. Also the ids and the range will be the same and so the indexes should be smaller.
The downside is about 1/3rd of the columns will be NULLS / wouldn't have had rows in the original structure. Also the joins will be less normalised. I'm unlikely to add more periods or ranks, but I can't rule it out.
Question 2. Should I instead try to create multiple tables?
Its a similar question to the above - basically should I make writing queries harder (infrequently) in the interest of being able to do joins faster every day.
Question 3. How much benefit would I get from having rankvalue as a smallint rather than a numeric?
I would prefer to store it as a percentile (between 0 and 1) so that I don't have to keep dividing by 100 when I use it but thought that across 40b records that the memory savings would add up. Given rankvalue is not in any indexes I suspect I have overthought this one...
Question 4. Anything else that I might have missed?
Thanks
SELECT security_id, ratio_id, rankvalue, validrangez, array_agg(period_id), array_agg(rank_id) FROM table GROUP BY security_id, ratio_id, rankvalue, validrangez
Would that also achieve the same number of rows like your solution in question1? If yes, then maybe consider aggregating some of the data in array columns. – Łukasz Kamiński