A StringToken Parser which gives Google Search style "Did you mean:" Suggestions

5

votes

Seeking a method to:

Take whitespace separated tokens in a String; return a suggested Word

ie:
Google Search can take "fonetic wrd nterpreterr",
and atop of the result page it shows "Did you mean: phonetic word interpreter"

A solution in any of the C* languages or Java would be preferred.

Are there any existing Open Libraries which perform such functionality?

Or is there a way to Utilise a Google API to request a suggested word?

language-agnosticparsingnlp

Is the issue the checking of spellings or checking of alternate similar spellings suggesting based on your own data? – Tristan Warner-Smith

11

votes

In his article How to Write a Spelling Corrector, Peter Norvig discusses how a Google-like spellchecker could be implemented. The article contains a 20-line implementation in Python, as well as links to several reimplementations in C, C++, C# and Java. Here is an excerpt:

The full details of an industrial-strength spell corrector like Google's would be more confusing than enlightening, but I figured that on the plane flight home, in less than a page of code, I could write a toy spelling corrector that achieves 80 or 90% accuracy at a processing speed of at least 10 words per second.

Using Norvig's code and this text as training set, i get the following results:

>>> import spellch
>>> [spellch.correct(w) for w in 'fonetic wrd nterpreterr'.split()]
['phonetic', 'word', 'interpreters']

2

votes

You can use the yahoo web service here: http://developer.yahoo.com/search/web/V1/spellingSuggestion.html

However it's only a web service... (i.e. there are no APIs for other language etc..) but it outputs JSON or XML, so... pretty easy to adapt to any language...

2

votes

You can also use the Google API's to spell check. There is an ASP implementation here (I'm not to credit for this, though).

2

votes

First off:

Use the one of your choice. I suspect it runs the query against a spell-checking engine with a word limit of exactly one, it then does nothing if the entire query is valid, otherwise it replaces each word with that word's best match. In other words, the following algorithm (an empty return string means that the query had no problems):

startup()
{
   set the spelling engines word suggestion limit to 1
}

option 1()
{
   int currentPosition = engine.NextWord(start the search at word 0, querystring);

   if(currentPosition == -1)
      return empty string; // Query is a-ok.

   while(currentPosition != -1)
   {
       queryString = engine.ReplaceWord(engine.CurrentWord, queryString, the suggestion with index 0);
       currentPosition = engine.NextWord(currentPosition, querystring);
   }

   return queryString;
}

2

votes

Since no one has yet mentioned it, I'll give one more phrase to search for: "edit distance" (for example, link text). That can be used to find closest matches, assuming it's typos where letters are transposed, missing or added.

But usually this is also coupled with some sort of relevancy information; either by simple popularity (to assume most commonly used close-enough match is most likely correct word), or by contextual likelihood (words that follow preceding correct word, or come before one). This gets into information retrieval; one way to start is to look at bigram and trigrams (sequences of words seen together). Google has very extensive freely available data sets for these.

For simple initial solution though a dictionary couple with Levenshtein-based matchers works surprisingly well.

2

votes

You could plug Lucene, which has a dictionary facility implementing the Levenshtein distance method.

Here's an example from the Wiki, where 2 is the distance.

String[] l=spellChecker.suggestSimilar("sevanty", 2);
//l[0] = "seventy"

1

votes

The Google SOAP Search APIs do that.

1

votes

If you have a dictionary stored as a trie, there is a fairly straightforward way to find best-matching entries, where characters can be inserted, deleted, or replaced.

void match(trie t, char* w, string s, int budget){
  if (budget < 0) return;
  if (*w=='\0') print s;
  foreach (char c, subtrie t1 in t){
    /* try matching or replacing c */
    match(t1, w+1, s+c, (*w==c ? budget : budget-1));
    /* try deleting c */
    match(t1, w, s, budget-1);
  }
  /* try inserting *w */
  match(t, w+1, s + *w, budget-1);
}

The idea is that first you call it with a budget of zero, and see if it prints anything out. Then try a budget of 1, and so on, until it prints out some matches. The bigger the budget the longer it takes. You might want to only go up to a budget of 2.

Added: It's not too hard to extend this to handle common prefixes and suffixes. For example, English prefixes like "un", "anti" and "dis" can be in the dictionary, and can then link back to the top of the dictionary. For suffixes like "ism", "'s", and "ed" there can be a separate trie containing just the suffixes, and most words can link to that suffix trie. Then it can handle strange words like "antinationalizationalization".

A StringToken Parser which gives Google Search style "Did you mean:" Suggestions

Seeking a method to:

Take whitespace separated tokens in a String; return a suggested Word

8 Answers