2
votes

I'm working with Azure Table (storage) in order to store information about websites I'm working with. So, I planned this structure:

  1. Partition Key - domain name
  2. Row key - Webpage address
  3. Valid until (date time) - after this date, the record will be deleted.
  4. Other crucial data here...

Those columns will be stored in a table called as the website address (e.g. "cnn.com").

I have two main use case (high to low): 1. Check if URL "x" is in the table - find by combination of Partition Key and Row Key - very efficient. 2. Delete old data - remove all expired data (according to "Valid until" column). This operation is taking place every mid-night and possibly delete millions of row - very heavy.

So, our first task (check if URL exists) is implemented in efficient way with this data model. The second task, not. I want to avoid batch deletion.

I also worry about making "hot-spots", which will make me low performance. This because the Partition Key. I expect that in some hours, I will query more question for specific domain. This will make this partition hotspot and hit my performance. In order to avoid this, I thought to use hash-function (on the URL) and the result will be the "partition key". Is this good idea?

I also thought about other implementation way and it's looks like they have some problems:

  • Storing the rows in table that named with the deletion date (e.g. "cnn.com-1-1-2016"). This provide us great deleting performance. But, bad searching experience (the row can be exists in more then one table. e.g. "cnn.com-1-1-2016" or "cnn.com-2-1-2016"...).

What is the right solution for my problem?

2
There's no "right" answer - this is a broad topic with lots of possible solutions (adding extra tables for deletion references; using an alternate database for searchable data for delete; etc). As for "hot spots," that'll be up to you to benchmark (as each partition provides up to 2000 transactions/sec).David Makogon

2 Answers

0
votes

Have you seen the Azure Table Storage Design Guide? It describes principles and patterns for designing tables solutions at scale. For hot spots take a look at the prepend / append anti-pattern for some extra information. This is where all your operations occur within a single partition which prevents additional resources from being added. For these types of scenarios you will get better scale if you can distribute the operations across partitions instead.

0
votes
  1. Let's assume you have a site https://www.yahoo.com/news/death-omar-al-shishani-could-mean-war-against-203132664.html?nhp=1. You can keep PK as domainName + "/news/" + 2 letters of page address, summary https://www.yahoo.com/news/de. RK - other part of the full address. This will split your domain partition on near 1000 partitions. If that's not enough - use 3 first letter in PK.

  2. Remove obsolete data every 15 minutes (create a separate service for it). Your millions will became just tens of thousands. Or keep less data (2 weeks instead of month for.ex.). And do not forget optimize deletion (get PK and RK only, update ETag to "*", remove as DynamicTableEntity, batch if possible).