I have a scenario where I store large amounts of third party data for ad-hoc analysis by business users. Most queries against the data will be complicated, using multiple self-joins, projections, and ranges.
When it comes to picking a PartitionKey for use in Azure DocumentDB, I see people advising to use a logical separator such as TenantId, DeviceId, etc.
Given the parallel nature of DocumentDB, however, I was curious how it would handle a PartitionKey based either on some sort of GUID or large integer so that, during large reads, it would be highly parellelized.
With that in mind, I devised a test with two collections:
test-col-1PartitionKeyis TenantId with roughly 100 possible values
test-col-2PartitionKeyis unique value assigned by third party that follows the pattern "AB1234568". Guaranteed to be globally unique by the third party.
Both collections are set to 100,000 RUs.
In my experiment I loaded both collections with roughly 2,000 documents. Each document is roughly 20 KB in size and is highly denormalized. Each document is an order, which contains multiple jobs, each of which contain users, prices, etc.
Example query:
SELECT
orders.Attributes.OrderNumber,
orders.Attributes.OpenedStamp,
jobs.SubOrderNumber,
jobs.LaborTotal.Amount As LaborTotal,
jobs.LaborActualHours As LaborHours,
jobs.PartsTotal.Amount As PartsTotal,
jobs.JobNumber,
jobs.Tech.Number As TechNumber,
orders.Attributes.OrderPerson.Number As OrderPersonNumber,
jobs.Status
FROM orders
JOIN jobs IN orders.Attributes.Jobs
JOIN tech IN jobs.Techs
WHERE orders.TenantId = @TentantId
AND orders.Attributes.Type = 1
AND orders.Attributes.Status IN (4, 5)";
In my testing I adjusted the following settings:
- Default
ConnectionPolicy - Best practices
ConnectionPolicyConnectionMode.Direct,Protocol.Tcp
- Various
MaxDegreeOfParallelismvalues - Various
MaxBufferedItemCount
The collection with the GUID PartitionKey was queried with EnableCrossPartitionQuery = true. I am using C# and the .NET SDK v1.14.0.
In my initial tests with default settings, I found that querying the collection with TentantId as the PartitionKey was faster, with it taking on average 3,765 ms compared to 4,680 ms on the GUID-keyed collection.
When I set the ConnectionPolicy to Direct with TCP, I discovered TenantID collection query times decreased by nearly 1000 ms to an average of 2,865 ms while the GUID collection increased by about 800 ms to an average of 5,492 ms.
Things started getting interesting when I started playing around with MaxDegreeOfParellelism and MaxBufferedItemCount. The TentantID collection query times were generally unaffected because the query wasn't cross-collection, however the GUID collection sped up considerably, reaching values as fast as 450 ms (MaxDegreeOfParellelism = 2000, MaxBufferedItemCount = 2000).
Given these observations, why would you not want to make the PartitionKey as broad a value as possible?