1
votes

After having read all the documentation I can find on document db, I'm still struggling on how best to design the partition key. Let's take a scenario of corporate emails sent to employees/departments. Let's say it is a massive company, 1 million employees, this is fictitious I just want to assume huge that sends a couple million emails a week, most get read and are clicked so ingesting large amounts of data.

Let me represent some of the entities as json. For all intents and purposes this data at least for my case is in a sql server but I'd like to track the opens, clicks by member and by department. With a large organization this data can grow quickly is the use case for DocumentDB. I don't want to debate the merits of DocumentDB for this, just looking to better understand the Partition Key design. Boiling it down:

Data

  • Newsletter: {newsletterId: 1, name: 'something', departments:[1,2,3]} // this newsletter sent to 3 company departments
  • Employee: {employeeId: 1212, name: 'John Smith'}
  • NewsletterEmployeeActivity: {newsletterId: 1, employeeId:1212, link: 2, date: '1-2-2017'} and/or {newsletterId: 1, employeeId: 1212, open: '1-2-2017'}// where link is the id of a link on the email

Reports

  • Opens and Clicks by Newsletter by Department
  • Open and Clicks by Department
  • Opens and Clicks for the entire Newsletter
  • Clicks by Link by Newsletter (just assume we can map the link id to the link)

How would you acchitect the partition key? Would having different document types that relate to the reports? ie, one that tracks employee clicks/opens (this would be a large amount of data), one that increments aggregates by department, an aggregate for the newsletter (maybe just sum the department) , etc or will this transaction get expensive to implement as you might get hit with 10000 opens about the same time as an email is read?

Having read about Hot partitions, the approach above would seem to fall into that category.

1
Are you planning to have DocumentDB perform the aggregations or are you planning to store the aggregate or reports in their final form within DocumentDB? - Denny Lee
Realistically I'm guessing a process will need to run every x period that does the aggregate. I'd store them in documentdb for my example. - lucuma
I'd like to add that if you would like to discuss your partition key choice/design in more detail with a DocumentDB engineer, please email [email protected]. - Aravind Krishna R.
Thank you I will. - lucuma

1 Answers

2
votes

For NewsletterEmployeeActivity, since you expect bursts of activity when a newsletter goes out, employeeId would be a good partition key to effectively scale out write traffic.

Since reports can typically be built with a little delay, I would recommend performing all your aggregates over newsletter, newsletter x department, department by running a windowed aggregate over the change feed, then saving the stats per hour/minute into a separate DocumentDB "stats" collection.

For storing Newsletter and Employee metadata, you might want to use newsletterId and employeeId respectively. You can store these documents in the same collection if you'd like by creating a synthetic partition key, then storing the appropriate value based on type.