2
votes

Tag-based web sites often suffer from the delicacy of language such as synonyms, homonyms, etc. For programmers looking for information, say on Stack Overflow, concrete examples are:

  • Subversion or SVN (or svn, with case-sensitive tags)
  • .NET or Mono
  • [Will add more]

The problem is that we do want to preserve our delicacy of language and make the machine deal with it as good as possible.

A site like del.icio.us sees its tag base grow a lot, thus probably hindering usage or search. Searching for SVN-related entries will probably list a majority of entries with both subversion and svn tags, but I can think of three issues:

  1. A search is incomplete as many entries may not have both tags (which are 'synonyms').
  2. A search is less useful as Q/A often lead to more Qs! Notably for newbies on a given topic.
  3. Tagging a question (note: or an answer separately, sounds useful) becomes philosophical: 'Did I Tag the Right Way?'

One way to address these issues is to create semantic links between tags, so that subversion and SVN are automatically bound by the system, not by poor users.

Is it an approach that sounds good/feasible/attractive/useful? How to implement it efficiently?

7
This is old. I am going to vote for closing it. Would someone with enough credentials move it to Meta? It seems a better fit over there, even though it is already old.Eric Platon
I don't know that it's really a Meta.SO q .. maybe a Programmers.SE fit?warren
Even if it was a Meta question, it's now too old to migrate.Mr Lister

7 Answers

3
votes

Recognizing synonyms and semantic connections is something that humans are good at; a solution to organizing an open-ended taxonomy like what SO is featuring would probably be well served by finding a way to leave the matching to humans.

One general approach: someone (or some team) reviews new tags on a daily basis. New synonyms are added to synonym groups. Searches hit synonym groups (or, more nuanced, hit either literal matches or synonym group matches according to user preference).

This requires support for synonym groups on the back end (work for the dev team). It requires a tag wrangler or ten (work for the principals or for trusted users). It doesn't require constant scaling, though—the rate at which the total tag pool grows will likely (after the initial Here Comes Everybody bump of the open beta) will in all likelihood decrease over time, as any organic lexicon's growth-rate does.

Synonymy strikes me as the go-to issue. Hierarchical mapping is an ambitious and more complicated issue; it may be worth it or it may not be, but given the relative complexity of defining the hierarchy it'd probably be better left as a Phase 2 to any potential synonym project's Phase 1.

1
votes

The way the software on blogspot.com is set up, is that there is an ajax-autocomplete-thingie on the box where you write the name of the tags. This searches all your previous posts for tags that start with the same letters. At least that way you catch different casings and spellings (but not synonyms).

1
votes

How would the system know which tags to semantically link? Would it keep an ever-growing map of tags? I can't see that working. What if someone typed sbversion instead? How would that get linked?

I think that asking the user when they submit tags could work. For example, "You've entered the following tags: sbversion, pascal and bindings. Did you mean, "Subversion", "Pascal" and "Bindings"?

Obviously the system would have to have a fairly smart matching system for that to work. Doing it this way would be extra input for the user (which'd probably annoy them) but the human input would, if done correctly, make for less duplicate tags.

In fact, having said all that, the system could use the results of the user's input as a basis for automatic tag matching. From the previous example, someone creates a tag of "sbversion" and when prompted changes it to "Subversion" - the system could learn that and do it automatically next time.

1
votes

Part of the issue you're looking at is that English is rife with synonyms - are the following different: build-management, subversion, cvs, source-control?

Maybe, maybe not. Having a system, like the one [now] in use on SO that brings up the tag you probably meant is extremely helpful. But it doesn't stop people from bulling-through the tagging process.

Maybe you could refuse to accept "new" tags without a user-interaction? Before you let 'sbversion' go in, force a spelling check?

This is definitely an interesting problem. I asked an open question similar to this on my blog last year. A couple of the responses were quite insightful.

1
votes

I completely agree. The mass of tags that have currently. I don't participate in other tagged based sites. However having a hierarchy of tags would be very helpful, instead of etc...

0
votes

Tags are basically our admission that search algorithms aren't up to snuff. If we can get a computer to be smart enough to identify that things tagged "Subversion" have similar content to things tagged "svn", presumably we can parse the contents, so why not skip tags altogether, and match a search term directly to the content (i.e., autotagging, which is basically mapping keywords to results)?!

0
votes

The problem is to make the search engine use the fact that 'subversion' and 'svn' are very similar to the point that they mean the same 'thing'.

It might be attractive to compute a simple similarity between tags based on frequency: 'subversion' and 'svn' appear very often together, so requesting 'svn' would return SVN-related questions, but also the rare questions only tagged 'subversion' (and vice versa). However, 'java' and 'c#' also appear often together, but for very different reasons (they are not synonyms). So similarity based on frequency is out.

An answer to this problem might be a mix of mechanisms, as the ones suggested in this Q/A thread:

  • Filtering out typos by suggesting tags when the user inputs them.
  • Maintaining a user-generated map of synonyms. This map may not be that big if it just targets synonyms.
  • Allowing multi-tag search, such that the user can put 'subversion svn' or 'subversion && svn' (well, from programmers to programmers) in the search box and get both. This would be quite practical as many users may actually try such approach when they do not know which term is the most meaningful.

@Nick: Agreed. The question is not meant to argue against tags. Tags have great potential, but users will face a growing issue if one cannot search 'across' tags.

@Steve: Maintaining an ever-growing map of tags is definitely not practical. As SO is accumulating an ever-growing bag of tags, how could we shade some light on this bag to make search of Q/A tags even more useful, in a convenient way?

@Espo: 'Ajax-powered' tag suggestions based on existing tags is apparently available on SO when creating a question. This is by the way very helpful to choose tags and appropriate spelling (avoiding the 'subversion' vs. 'sbversion' issue from Steve).