0
votes

I have a question about DocumentDB partition key choise. I have data with UserId, DeviceId and WhateverId. UserId parameter will be in queries always, so I have chosen UserId as a partition key. But I have a lot of data for one user (millions of entities) and when I made a quety like "SELECT * FROM c WHERE c.DeviceId = @DeviceId" with partition key specified it takes a lot of time(about 6 minutes for about 220 000 returned entities). Maybe it would be more efficient to choose for example DeviceId as a partition key and make queries against a few partitions in parallel (specifying EnableCrossPartitionQuery = true and MaxDegreeOfParallelism = partition count)? Or maybe it is a good idea to use separate collection for every user?

2
Not that this answers your question, but... I think any time you're trying to retrieve a quarter-million entities, you might want to rethink your data access pattern. Also, "SELECT *" is yet another code-smell. I don't see how your choice of partition key is going to make a difference if you're trying to move that much data to your app tier.David Makogon
Thanks. SELECT * was just for quick example, sorry. I'll use SELECT c.Value. And this question is just about choosing partition key, because information on azure documentation site is a little bit abstract as for me. All this measurements is just for performance comparison depending on query.Paval

2 Answers

1
votes

It might help a little but I don't think a partition for each user will solve your problem because you essentially have that under the covers.

You could experiment with the partition key to improve the parrallism but, at best that would give you 2x to 5x improvement in my experience. Is that enough?

For more dramatic improvements you usually have to resort to selective denormalization and/or caching.

0
votes

I know this is a bit old, but for the benefit of others coming to this topic...

From your description I assume that the devices are mostly unique to the user. It is often advised to partition on something like userid which is good if you have, say a call centre application, with many queries for a given userid and want to look up no more than a few hundred entries. In such cases the data can be quickly extracted from a single partition without the overhead of having to collate data across partitions. However, if you have millions of records for the user then partitioning on User Id is perhaps the worst option as extracting large volumes of data from a single partition will soon exceed the overhead of collation. In such cases you want to distribute user data as evenly as possible over all partitions. Unless each user has 25+ devices with similar usage then Device Id is probably not a good choice either.

In cases such as yours, I generally find a system generated incrementing key (e.g. Event Id or Transaction Id) to be the best choice.