3
votes

We are looking into migrating our app to a multi-tenant database. Currently, the app runs with one database per tenant. There are currently around 400 tenants. When combined, the largest table would have around 1 billion rows and would grow as tenants are added. Size by tenant varies wildly, with one tenant alone having 180 million records in that table, some having less than a million. There are a few other tables in the hundred millions, most tables would have much less. My main concerns revolve around planning for scalability for the large tables, and I'll focus on the largest one. The parameters for it are that it's a linking/many-to-many table with basic audit fields for created by and created date (though am questioning if those are even necessary for this one). Date/time is not relevant to this, this is an assignment table and applies at all times. Records can get deleted or inserted, not updated, some times in bulk, probably not frequently but can happen at any time. Data cardinality would be relatively high on both foreign keys I think, though I'm not sure what constitutes high cardinality as a ratio to total number of records. For some perspective, the tenant with 180 million records has around 100,000 distinct records for one foreign key and 165,000 for the other. Meanwhile, another client has around 180,000 records, with 500 distinct values in one field and 5000 in the other. So as I said, a lot of variability.

Would the kind of table I described above (billions of rows, high data cardinality, not time based, tenant segmented, bulk insert/deletes at any time) in the kind of scenario I described (400+ tenants with varying amounts of data) be a good candidate for partitioning? The reason I'm concerned about this now is that I've read in a number of places that partitioning is something that can be much less painful to deal with if you plan for it ahead of time rather than try to partition later after the table is huge and harder to work with without requiring down-time or jumping through hoops. At this point, my main concern is not so much querying the data, I tested with a table with 1 billion records and with a proper index select queries run very fast. I'm more worried about concurrency with the read/write/delete, running into blocking because of locks, etc. If partitioning is warranted, what would a good strategy be? Partition by tenant? Just partition large ones and keep smaller ones bundled together?

1

1 Answers

3
votes

Given that you said that query performance is not an issue, the only reason I can think of to consider partitioning is to make mass purging easier to accomplish.

Do you have contractual or legal retention policies in place?

The most common scenario would be using time periods as your partition key so that rolling-off old data is simply a matter of dropping partitions, but since you clearly state that date/time is not relevant, I do not see how that would help.

Is it common for you to roll-on/roll-off individual customers? Is there a purging or retention requirement? If so, then partitioning by customer, no matter how imbalanced the partitions would be, would make sense since you could purge a large customer's data without affecting other customers' access to their data.

As for any concurrency issues, partitioning by customer should help contain these problems within a specific customer that is showing heavy activity.

I recommend testing this thoroughly for a few reasons:

  • I have not seen multiple active partitions in action because I have worked only with time series partitions
  • I have not looked deeply into PostgreSQL 12's foreign key enhancements and wonder whether a foreign key with a partitioned table on both sides would complicate dropping parititons
  • I have never explored the practical limits of the number of partitions a database could contain

I may be reading things from my experience into your question about partitioning, but have you considered a schema per customer?