I am creating an Azure Cosmos database to store phone call records for multiple customers in a multi-tenant system. The users have the need to search through these phone records by phone number, name, extension number, and date range. The structure for each phone call record is pretty solid (I think) and looks like this:
{
"id": "M7o_ddRB7VRpjUA",
"ownerTime": "2458671015_202008",
"ownerId": "2458671015",
"callTime": "2020-08-30T18:47:44.424+00:00",
"direction": "Outbound",
"action": "VoIP Call",
"result": "Call connected",
"duration": 57,
"hasR": true,
"hasV": false,
"xNums": [
"2605"
],
"xIds": [
"2204328014"
],
"names": [
"sally wayfield"
],
"phoneNums": [
"2098368307",
"2097449211"
],
"emails": [
"[email protected]"
],
"xNums_s": "2605",
"xIds_s": "2204328014",
"phoneNums_s": "2098368307 2097449211",
"names_s": "sally wayfield",
"emails_s": "[email protected]",
"_rid": "QB1nAK5kxPMBAAAAAAAAAA==",
"_self": "dbs/QB1nAA==/colls/QB1nAK5kxPM=/docs/QB1nAK5kxPMBAAAAAAAAAA==/",
"_etag": "\"d2013319-0000-0500-0000-5f4c63dd0000\"",
"_attachments": "attachments/",
"_ts": 1598841821
}
But I have struggled to come up with a good partition key strategy. My initial thought was to keep it simple and designate the account id of each tenant as the partition key. In the document above that account id is the ownerId
. I did some bench-marking on some imported data from our old database, and it worked well enough. However, the volume of call records varies considerably from tenant to tenant. One particular problem I came across is that there were several tenants who are ingesting around 10,000 call records per day. And for a 6-month period, that equated to around 4GB. As we continue to ingest 10,000 call records per day for these accounts, We will hit the limit of 20GB per partition in under 3 years. So this strategy will not work.
My next idea was to include temporal data as part of the partition key. This seems to make sense, as we batch import mass quantities of time-stamped data on an ongoing, daily basis. And, 99% of all queries against this database will include a date range. I came up with a derived, abstract value labeled in the document above as ownerTime
. This value combines the account id with a time factor (year and month of each phone call). Using this scheme, each tenant will get 12 partitions per year instead of a single partition for the whole account. The volume per tenant still varies, so we will still have some large partitions alongside some tiny partitions. But the differences will not be as extreme as when we had one partition per tenant.
Using the "one partition per tenant" strategy, a typical query looks like this:
partitionKey = /ownerId
select c.id,c.callTime,c.direction,c.action,c.result,c.duration,c.hasR,c.hasV,c.callers
from c
where
ownerId='2458671015'
and c.callTime>='2020-01-01T00:00:00 +000'
and c.callTime<='2020-08-31T23:59:59 +000'
and (CONTAINS(c.phoneNums_s, 'rus')
or CONTAINS(c.names_s, 'rus')
or CONTAINS(c.xNums_s, 'rus'))
order by c.callTime desc
Using the ownerTime
partition strategy,the same query looks like this:
partitionKey = /ownerTime
select c.id,c.callTime,c.direction,c.action,c.result,c.duration,c.hasR,c.hasV,c.callers
from c
where array_contains([
'2458671015_202001',
'2458671015_202002',
'2458671015_202003',
'2458671015_202004',
'2458671015_202005',
'2458671015_202006',
'2458671015_202007',
'2458671015_202008'], c.ownerTime)
and c.callTime>='2020-01-01T00:00:00 +000'
and c.callTime<='2020-08-31T23:59:59 +000'
and (CONTAINS(c.phoneNums_s, 'rus')
or CONTAINS(c.names_s, 'rus')
or CONTAINS(c.xNums_s, 'rus'))
order by c.callTime desc
As you can see in the ownerTime
strategy, I am explicitly calculating which partitions need to be searched (based on the search date range). And I am defining those partitions in my query.
In my initial benchmarking tests, I attempted to retrieve 25 records with these queries against one of my high volume accounts (with 1.1 million phone records). The "single partition per tenant" query used around 2200 RUs to bring back a single record. I had to keep using the continuation token to get more results, with each request using around 2000 RUs until I finally got to 25 records. I lost count of the RUs after many iterations. The ownerTime
partition query used around 2500 RUs to retrieve 25 records without having to use the continuation token.
It seems that the ownerTime
strategy is a better choice, as long as all queries include a date range. But I wanted to get some feedback on this strategy. It's not perfect, but it seems to work good. I come from a SQL background, and this is my first time implementing a document database. I am doing it to take advantage of the improved text searching and scalability. Are there "gotcha's" I might encounter in the future using the ownerTime
partition strategy? Based on the information I've provided, is there a partition strategy that might work better?