182
votes

How would you design a database to support the following tagging features:

  • items can have a large number of tags
  • searches for all items that are tagged with a given set of tags must be quick (the items must have ALL tags, so it's an AND-search, not an OR-search)
  • creating/writing items may be slower to enable quick lookup/reading

Ideally, the lookup of all items that are tagged with (at least) a set of n given tags should be done using a single SQL statement. Since the number of tags to search for as well as the number of tags on any item are unknown and may be high, using JOINs is impractical.

Any ideas?


Thanks for all the answers so far.

If I'm not mistaken, however, the given answers show how to do an OR-search on tags. (Select all items that have one or more of n tags). I am looking for an efficient AND-search. (Select all items that have ALL n tags - and possibly more.)

12
from scaling perspective you have to use a minor-major tag technique else performance will degrade when lots of major tags come forward (and they will do). Especially if you use a de-normalized schema this will happen much faster - for RDBMS solutions alwaysStefanos Zilellis

12 Answers

82
votes

Here's a good article on tagging Database schemas:

http://howto.philippkeller.com/2005/04/24/Tags-Database-schemas/

along with performance tests:

http://howto.philippkeller.com/2005/06/19/Tagsystems-performance-tests/

Note that the conclusions there are very specific to MySQL, which (at least in 2005 at the time that was written) had very poor full text indexing characteristics.

24
votes

About ANDing: It sounds like you are looking for the "relational division" operation. This article covers relational division in concise and yet comprehendible way.

About performance: A bitmap-based approach intuitively sounds like it will suit the situation well. However, I'm not convinced it's a good idea to implement bitmap indexing "manually", like digiguru suggests: It sounds like a complicated situation whenever new tags are added(?) But some DBMSes (including Oracle) offer bitmap indexes which may somehow be of use, because a built-in indexing system does away with the potential complexity of index maintenance; additionally, a DBMS offering bitmap indexes should be able to consider them in a proper when when performing the query plan.

14
votes

I just wanted to highlight that the article that @Jeff Atwood links to (http://howto.philippkeller.com/2005/04/24/Tags-Database-schemas/) is very thorough (It discusses the merits of 3 different schema approaches) and has a good solution for the AND queries that will usually perform better than what has been mentioned here so far (i.e. it doesn't use a correlated subquery for each term). Also lots of good stuff in the comments.

ps - The approach that everyone is talking about here is referred to as the "Toxi" solution in the article.

13
votes

I don't see a problem with a straightforward solution: Table for items, table for tags, crosstable for "tagging"

Indices on cross table should be enough optimisation. Selecting appropriate items would be

SELECT * FROM items WHERE id IN  
    (SELECT DISTINCT item_id FROM item_tag WHERE  
    tag_id = tag1 OR tag_id = tag2 OR ...)  

AND tagging would be

SELECT * FROM items WHERE  
    EXISTS (SELECT 1 FROM item_tag WHERE id = item_id AND tag_id = tag1)  
    AND EXISTS (SELECT 1 FROM item_tag WHERE id = item_id AND tag_id = tag2)  
    AND ...

which is admittedly, not so efficient for large number of comparing tags. If you are to maintain tag count in memory, you could make query to start with tags that are not often, so AND sequence would be evaluated quicker. Depending on expected number of tags to be matched against and expectancy of matching any single of them this could be OK solution, if you are to match 20 tags, and expect that some random item will match 15 of them, then this would still be heavy on a database.

7
votes

You might want to experiment with a not-strictly-database solution like a Java Content Repository implementation (e.g. Apache Jackrabbit) and use a search engine built on top of that like Apache Lucene.

This solution with the appropriate caching mechanisms would possibly yield better performance than a home-grown solution.

However, I don't really think that in a small or medium-sized application you would require a more sophisticated implementation than the normalized database mentioned in earlier posts.

EDIT: with your clarification it seems more compelling to use a JCR-like solution with a search engine. That would greatly simplify your programs in the long run.

5
votes

The easiest method is to create a tags table.
Target_Type -- in case you are tagging multiple tables
Target -- The key to the record being tagged
Tag -- The text of a tag

Querying the data would be something like:

Select distinct target from tags   
where tag in ([your list of tags to search for here])  
and target_type = [the table you're searching]

UPDATE
Based on your requirement to AND the conditions, the query above would turn into something like this

select target
from (
  select target, count(*) cnt 
  from tags   
  where tag in ([your list of tags to search for here])
    and target_type = [the table you're searching]
)
where cnt = [number of tags being searched]
1
votes

I'd second @Zizzencs suggestion that you might want something that's not totally (R)DB-centric

Somehow, I believe that using plain nvarchar fields to store that tags with some proper caching/indexing might yield faster results. But that's just me.

I've implemented tagging systems using 3 tables to represent a Many-to-Many relationship before (Item Tags ItemTags), but I suppose you will be dealing with tags in a lot of places, I can tell you that with 3 tables having to be manipulated/queried simultaneously all the time will definitely make your code more complex.

You might want to consider if the added complexity is worth it.

0
votes

You won't be able to avoid joins and still be somewhat normalized.

My approach is to have a Tag Table.

 TagId (PK)| TagName (Indexed)

Then, you have a TagXREFID column in your items table.

This TagXREFID column is a FK to a 3rd table, I'll call it TagXREF:

 TagXrefID | ItemID | TagId

So, to get all tags for an item would be something like:

SELECT Tags.TagId,Tags.TagName 
     FROM Tags,TagXref 
     WHERE TagXref.TagId = Tags.TagId 
         AND TagXref.ItemID = @ItemID

And to get all items for a tag, I'd use something like this:

SELECT * FROM Items, TagXref
     WHERE TagXref.TagId IN 
          ( SELECT Tags.TagId FROM Tags
                WHERE Tags.TagName = @TagName; )
     AND Items.ItemId = TagXref.ItemId;

To AND a bunch of tags together, You would to modify the above statement slightly to add AND Tags.TagName = @TagName1 AND Tags.TagName = @TagName2 etc...and dynamically build the query.

0
votes

What I like to do is have a number of tables that represent the raw data, so in this case you'd have

Items (ID pk, Name, <properties>)
Tags (ID pk, Name)
TagItems (TagID fk, ItemID fk)

This works fast for the write times, and keeps everything normalized, but you may also note that for each tag, you'll need to join tables twice for every further tag you want to AND, so it's got slow read.

A solution to improve read is to create a caching table on command by setting up a stored procedure that essentially creates new table that represents the data in a flattened format...

CachedTagItems(ID, Name, <properties>, tag1, tag2, ... tagN)

Then you can consider how often the Tagged Item table needs to be kept up to date, if it's on every insert, then call the stored procedure in a cursor insert event. If it's an hourly task, then set up an hourly job to run it.

Now to get really clever in data retrieval, you'll want to create a stored procedure to get data from the tags. Rather than using nested queries in a massive case statement, you want to pass in a single parameter containing a list of tags you want to select from the database, and return a record set of Items. This would be best in binary format, using bitwise operators.

In binary format, it is easy to explain. Let's say there are four tags to be assigned to an item, in binary we could represent that

0000

If all four tags are assigned to an object, the object would look like this...

1111

If just the first two...

1100

Then it's just a case of finding the binary values with the 1s and zeros in the column you want. Using SQL Server's Bitwise operators, you can check that there is a 1 in the first of the columns using very simple queries.

Check this link to find out more.

0
votes

To paraphrase what others have said: the trick isn't in the schema, it's in the query.

The naive schema of Entities/Labels/Tags is the right way to go. But as you've seen, it's not immediately clear how to perform an AND query with a lot of tags.

The best way to optimize that query will be platform-dependent, so I would recommend re-tagging your question with your RDBS and changing the title to something like "Optimal way to perform AND query on a tagging database".

I have a few suggestions for MS SQL, but will refrain in case that's not the platform you're using.

0
votes

A variation to the above answer is take the tag ids, sort them, combine as a ^ separated string and hash them. Then simply associate the hash to the item. Each combination of tags produces a new key. To do an AND search simply re-create the hash with the given tag ids and search. Changing tags on an item will cause the hash to be recreated. Items with the same set of tags share the same hash key.

0
votes

If you've an array type, you can pre-aggregate the needed data. See this answer in a separate thread:

what's the utility of array type?