3
votes

I'm working on a Cassandra data model to store records uploaded by users.

The potential problem is, some users may upload 50-100k rows in a 5 minute period, which can result in a "hot spot" for the partiton key (user_id). (Datastax recommendation is to rethink data model if more than 10k rows per partition).

How can I avoid having too many records on a partition key in a short amount of time?

I've tried using the Time Series suggestions from Datastax, but even if I had year, month, day, hour columns, a hot spot may still occur.

CREATE TABLE uploads (
    user_id text
   ,rec_id timeuuid
   ,rec_key text
   ,rec_value text
   ,PRIMARY KEY (user_id, rec_id)
);   

The use cases are:

  • Get all upload records by user_id
  • Search for upload records by date range range
1
Can you post an example query that you might try? That will help in determining a useful time bucket. - Aaron
1) select rec_id, rec_key, rec_value from uploads where user_id = 'fred'; - user2966445
2) select rec_id, rec_key, rec_value from uploads where user_id = 'fred' and rec_id >= maxTimeuuid('2015-01-01 00:00+0000') AND rec_id < minTimeuuid('2015-02-01 00:00+0000') - user2966445

1 Answers

7
votes

A few possible ideas:

  1. Use a compound partition key instead of just user_id. The second part of the partition key could be a random number from 1 to n. For example if n were 5, then your uploads would be spread out over five partitions per user instead of just one. The downside is when you do reads, you have to repeat them n times to read all the partitions.

  2. Have a separate table to handle incoming uploads using the rec_id as the partition key. This would spread the load of uploads equally across all the available nodes. Then to get that data into the table with user_id as the partition key, periodically run a spark job to extract new uploads and add them to the user_id based table at a rate the the single partitions can handle.

  3. Modify your front end to throttle the rate at which an individual user can upload records. If only a few users are uploading at a high enough rate to cause a problem, it may be easier to limit them rather than modify your whole architecture.