2
votes

Is there a common way of dealing with tags in databases?

I'm thinking of using tinytext with pipes. I think adding another table and using IDs might make it more complicated for little gain.

What's your preferred way of doing this?

and what is the right way of doing queries in a table to find results matching multiple or single tags?

Thanks

3
You may also consider JSON encoding your tags.Alix Axel

3 Answers

8
votes

Implement a simple N:N relations.

Items:
-id
-name
-title
-foo
-bar

ItemTagRelations:
-Fkey itemId
-Fkey tagId

Tags:
-id
-name
-etc

2
votes

I'll spread little heresy here.

Big boys, including this site are using denormalised schemas for tags for scalability reasons, storing comma, pipe or space delimited tags in text type field for each row and separate table for tags with counts. Upon inserting or updating an item just check what tags were added or dropped and update counts accordingly (explode to arrays old and new tag strings and do array_diff() ).

Now you have cheap way to display tag cloud with counts by simple SELECT * FROM tags, no fancy queries. To find items tagged with given name just do LIKE '%TAG%', this will work well for small traffic website (say less then 100k page views per day) and small data sets (again, say less than 100k of records). Above that you could use Fulltext Search to speed things up and ultimately proper search engine like Lucene or Sphinx.

Finding related tags, like here on SO, is easy too (Kohana specific code, LIKE based, MySQL specific):

$tags = array('foo', 'bar');

private function get_related_tags( $tags )
{
    ## Get db entries with specific tags and build array with counts

    ## is it cached already? ------------------------------------------------
    $this->cache = Cache::instance();
    $tags = array_filter( array_flip(array_flip($tags)) );
    sort($tags);
    $cache_name = implode('', $tags);
    $cache = $this->cache->get( $cache_name );

    if( $cache )
     return $cache;

    ## not cached, fire up ---------------------------------------------------

    $db = Database::instance();

    ## count tagged items ----------------------------------------------------

    // build like string
    $like = array();
    foreach( $tags as $tag )
       $like[] = "tags LIKE '%$tag%'";

    $like = implode(' AND ', $like);

    // get counts
    $count = $db->query("SELECT count(id) AS count FROM `articles` WHERE $like")->current()->count;

    ## check what tags are related ------------------------------------------

    $offset = 0;
    $step = 300;

    $related_tags = array();

    while( $offset < $count )
    {
        $assets = $db->query("SELECT tags FROM `articles` WHERE $like ORDER BY id ASC LIMIT $step OFFSET $offset");

        foreach($assets as $asset)
        {
            // tags 
            $input = explode( ' ', trim($asset->tags) );
            foreach( $input as $k => $v )
            {
                 if( $v == ''){
                     //do nothing, shouldnt be here anyway
                 }
                 elseif( array_key_exists($v, $related_tags) ){
                     $related_tags[$v]++;
                 }
                 else{
                    $related_tags[$v] = 1;
                 }        
            }
        }
        $offset += $step;
    }

    // remove already displayed from list
    foreach( $tags as $tag )
        unset( $related_tags[$tag] );

    ksort($related_tags);

    // set cache 
    $this->cache->set( $cache_name, array($related_tags, $count), 'related_tags_counts', 0);

    return array($related_tags, $count);
}

This is not really cheap so I keep counts cached for given set of tags until I make changes to tags in articles table.

This setup is not perfect by any means, but certainly has some advantages. Schema is simple, getting tag cloud is straightforward, getting articles along with tags with one simple query (ie without subqueries). As main disadvantages I would see inability to rename or drop tag system-wide without amending every single row where it occurs, but hey, how often you do that anyway?

Currently I'm using this setup for few projects of mine and it works like a dream, but I must admit these are not high traffic websites (hence I get away with LIKE), next year I will be able to test it with busy site but I'm pretty sure it will do. Normalization nazis will vote me down perhaps, but I just love simplicity of it and I'm happy to trade off cpu cycles for that.

Actually I was going to post this tag system a while ago on SO and ask experts what they think of it so feel free to leave comments.

Traditionally, sorry for my English, I believe it's funny =)

EDIT

Since you've provided your requiremnents in comments, I think this setup is perfect for you. I've posted full Tag Model in pastie here, with methods to handle counts, Kohana specific but if you know Codeigniter you'll feel home. Just use it this way:

table TAGS: id, tag_name, tag_count

// insert new item/article
$tag_model->update_tags( $tags_str, null );

// update existing item 
$tag_model->update_tags( $new_tags_str, $old_tags_str ); // $old_tags as stored in db

// delete item, you'll have to get item from db before deletion
$tag_model->update_tags( null, $old_tags_str ); 

I've amended the code as markdown have mangled it up, also queries are mySQL flavour, not SQLite.

0
votes

Yes, don't scare away from normalization and having each tag in its own record. This will ultimately be the most flexible and with the correct indexing the fastest.