1
votes

working some numbers for a new Postgres build and wanted some advice on partitioning/sizing as I have belatedly realised that I'm about to create a 40+ billion row table and keep adding another 1.5 billion rows per year.

I'm a recent immigrant to Postgres from MSSQL and so still trying to work out what is possible/advisable...

This is the current table design:

  security_id int NOT NULL,   -- 5,000-10,000 securities
  ratio_id smallint NOT NULL, -- ~100 ratios
  period_id smallint NOT NULL, -- between 1 and 5 periods 
  rank_id smallint NOT NULL,   -- between 1 and 5 different ways to rank
  rankvalue smallint NOT NULL CHECK (ratiovalue between 0 and 101),
  validrangez tstzrange NOT NULL -- 30 years of dailyish data. 

With the date range some records don't change for months, others change daily, and timezone matters which is why I'm using a range. There is a gist constraint to avoid overlaps.

Most of the queries will be looking at a particular date in the validrangez and then joining with other tables for everything at that date.

I am thinking of partitioning by the year of the upper(validrangez).

Question 1. Should I turn the period_id and rank_id fields into columns?

The upside seems to be that this would turn the table from a 40 billion row table into a 3-4 billion row table which seems more manageable as each partition would only be 100-150m rows rather than a billion. Also the ids and the range will be the same and so the indexes should be smaller.

The downside is about 1/3rd of the columns will be NULLS / wouldn't have had rows in the original structure. Also the joins will be less normalised. I'm unlikely to add more periods or ranks, but I can't rule it out.

Question 2. Should I instead try to create multiple tables?

Its a similar question to the above - basically should I make writing queries harder (infrequently) in the interest of being able to do joins faster every day.

Question 3. How much benefit would I get from having rankvalue as a smallint rather than a numeric?

I would prefer to store it as a percentile (between 0 and 1) so that I don't have to keep dividing by 100 when I use it but thought that across 40b records that the memory savings would add up. Given rankvalue is not in any indexes I suspect I have overthought this one...

Question 4. Anything else that I might have missed?

Thanks

1
SELECT security_id, ratio_id, rankvalue, validrangez, array_agg(period_id), array_agg(rank_id) FROM table GROUP BY security_id, ratio_id, rankvalue, validrangez Would that also achieve the same number of rows like your solution in question1? If yes, then maybe consider aggregating some of the data in array columns.Łukasz Kamiński
Thank - would have to have rankvalue as an array_agg as well, and so that would save me storing some (maybe 10-20%) data. Every little bit will help I suspect. Will need to check how much that slows other calculations down.user1487861

1 Answers

0
votes

May be creating views year wise would help. Plus also check the CURSOR option