166
votes

I'm working on the design for a RoR project for my company, and our development team has already run into a bit of a debate about the design, specifically the database.

We have a model called Message that needs to be persisted. It's a very, very small model with only three db columns other than the id, however there will likely be A LOT of these models when we go to production. We're looking at as much as 1,000,000 insertions per day. The models will only ever be searched by two foreign keys on them which can be indexed. As well, the models never have to be deleted, but we also don't have to keep them once they're about three months old.

So, what we're wondering is if implementing this table in Postgres will present a significant performance issue? Does anyone have experience with very large SQL databases to tell us whether or not this will be a problem? And if so, what alternative should we go with?

2
with a good caching layer and some little configuration in PG you should be fine. You should tackle performance issues case by case and avoid preoptimizing. That said, partitioning and replicating are always great options you can take advantage of once you hit bottlenecks.Sam
Related question here and here.Erwin Brandstetter
We process about 30 million messages per day in one 5+ TB PostgreSQL database, works fine.Frank Heikens
FYI, I happened to be reading postgresql.org/about today and noticed that it says that (in principle) the number of rows in a table is unlimited.Al Chou

2 Answers

138
votes

Rows per a table won't be an issue on it's own.

So roughly speaking 1 million rows a day for 90 days is 90 million rows. I see no reason Postgres can't deal with that, without knowing all the details of what you are doing.

Depending on your data distribution you can use a mixture of indexes, filtered indexes, and table partitioning of some kind to speed thing up once you see what performance issues you may or may not have. Your problem will be the same on any other RDMS that I know of. If you only need 3 months worth of data design in a process to prune off the data you don't need any more. That way you will have a consistent volume of data on the table. Your lucky you know how much data will exist, test it for your volume and see what you get. Testing one table with 90 million rows may be as easy as:

select x,1 as c2,2 as c3
from generate_series(1,90000000) x;

https://wiki.postgresql.org/wiki/FAQ

Limit   Value
Maximum Database Size       Unlimited
Maximum Table Size          32 TB
Maximum Row Size            1.6 TB
Maximum Field Size          1 GB
Maximum Rows per Table      Unlimited
Maximum Columns per Table   250 - 1600 depending on column types
Maximum Indexes per Table   Unlimited
80
votes

Another way to speed up your queries significantly on a table with > 100 million rows is in the off hours cluster the table on the index that is most often used in your queries. We have a table with > 218 million rows and have found 30X improvements.

Also, for a very large table, it's a good idea to create an index on your foreign keys.