I'm working with Azure Table (storage) in order to store information about websites I'm working with. So, I planned this structure:
- Partition Key - domain name
- Row key - Webpage address
- Valid until (date time) - after this date, the record will be deleted.
- Other crucial data here...
Those columns will be stored in a table called as the website address (e.g. "cnn.com").
I have two main use case (high to low): 1. Check if URL "x" is in the table - find by combination of Partition Key and Row Key - very efficient. 2. Delete old data - remove all expired data (according to "Valid until" column). This operation is taking place every mid-night and possibly delete millions of row - very heavy.
So, our first task (check if URL exists) is implemented in efficient way with this data model. The second task, not. I want to avoid batch deletion.
I also worry about making "hot-spots", which will make me low performance. This because the Partition Key. I expect that in some hours, I will query more question for specific domain. This will make this partition hotspot and hit my performance. In order to avoid this, I thought to use hash-function (on the URL) and the result will be the "partition key". Is this good idea?
I also thought about other implementation way and it's looks like they have some problems:
- Storing the rows in table that named with the deletion date (e.g. "cnn.com-1-1-2016"). This provide us great deleting performance. But, bad searching experience (the row can be exists in more then one table. e.g. "cnn.com-1-1-2016" or "cnn.com-2-1-2016"...).
What is the right solution for my problem?