How to design data storage for partitioned tagging system?

Question

How to design data storage for huge tagging system (like digg or delicious)?

There is already discussion about it, but it is about centralized database. Since the data is supposed to grow, we'll need to partition the data into multiple shards soon or later. So, the question turns to be: How to design data storage for partitioned tagging system?

The tagging system basically has 3 tables:

Item (item_id, item_content)

Tag (tag_id, tag_title)

TagMapping(map_id, tag_id, item_id)

That works fine for finding all items for given tag and finding all tags for given item, if the table is stored in one database instance. If we need to partition the data into multiple database instances, it is not that easy.

For table Item, we can partition its content with its key item_id. For table Tag, we can partition its content with its key tag_id. For example, we want to partition table Tag into K databases. We can simply choose number (tag_id % K) database to store given tag.

But, how to partition table TagMapping?

The TagMapping table represents the many-to-many relationship. I can only image to have duplication. That is, same content of TagMappping has two copies. One is partitioned with tag_id and the other is partitioned with item_id. In scenario to find tags for given item, we use partition with tag_id. If scenario to find items for given tag, we use partition with item_id.

As a result, there is data redundancy. And, the application level should keep the consistency of all tables. It looks hard.

Is there any better solution to solve this many-to-many partition problem?

Michael Petito Michael Petito · Accepted Answer · 2010-04-24T04:47:07

I doubt there is a single approach that optimizes all possible usage scenarios. As you said, there are two main scenarios that the TagMapping table supports: finding tags for a given item, and finding items with a given tag. I think there are some differences in how you will use the TagMapping table for each scenario that may be of interest. I can only make reasonable assumptions based on typical tagging applications, so forgive me if this is way off base!

Finding Tags for a Given Item

A1. You're going to display all of the tags for a given item at once

A2. You're going to ensure that all of an item's tags are unique

Finding Items for a Given Tag

B1. You're going to need some of the items for a given tag at a time (to fill a page of search results)

B2. You might allow users to specify multiple tags, so you'd need to find some of the items matching multiple tags

B3. You're going to sort the items for a given tag (or tags) by some measure of popularity

Given the above, I think a good approach would be to partition TagMapping by item. This way, all of the tags for a given item are on one partition. Partitioning can be more granular, since there are likely far more items than tags and each item has only a handful of tags. This makes retrieval easy (A1) and uniqueness can be enforced within a single partition (A2). Additionally, that single partition can tell you if an item matches multiple tags (B2).

Since you only need some of the items for a given tag (or tags) at a time (B1), you can query partitions one at a time in some order until you have as many records needed to fill a page of results. How many partitions you will have to query will depend on how many partitions you have, how many results you want to display and how frequently the tag is used. Each partition would have its own index on tag_id to answer this query efficiently.

The order you pick partitions in will be important as it will affect how search results are grouped. If ordering isn't important (i.e. B3 doesn't matter), pick partitions randomly so that none of your partitions get too hot. If ordering is important, you could construct the item id so that it encodes information relevant to the order in which results are to be sorted. An appropriate partitioning scheme would then be mindful of this encoding. For example, if results are URLs that are sorted by popularity, then you could combine a sequential item id with the Google Page Rank score for that URL (or anything similar). The partitioning scheme must ensure that all of the items within a given partition have the same score. Queries would pick partitions in score order to ensure more popular items are returned first (B3). Obviously, this only allows for one kind of sorting and the properties involved should be constant since they are now part of a key and determine the record's partition. This isn't really a new limitation though, as it isn't easy to support a variety of sorts, or sorts on volatile properties, with partitioned data anyways.

How to design data storage for partitioned tagging system?

3 Answers