26
votes

I'm using the Levenshtein algorithm to find the similarity between two strings. This is a very important part of the program I'm making, so it needs to be effective. The problem is that the algorithm doesn't find the following examples as similar:

CONAIR
AIRCON

The algorithm will give a distance of 6. So for this word of 6 letters (You look at the word with the highest amount of letters), the difference is of 100% => the similarity is 0%.

I need to find a way to find the similarities between two string, but also taking into consideration cases like the one I presented before.

Is there a better algorithm I can use? Or what do you guys recommend me?

EDIT: I've also looked into the "Damerau–Levenshtein" algorithm, which adds transpositions. The problem is that this transpositions are only for adjacent characters (and not for a number of characters).

7
Before you can figure out a string distance algorithm, you need to define clearly what sort of transformations you think are acceptable. What makes these strings more similar to each other than two random 6-letter strings? Can you express it in a manner such that you can 'hill climb' from one string to the other, getting more similar each step? – Nick Johnson

7 Answers

11
votes

I would divide the term into unigrams, bigrams and trigrams, then calculate cosine similarity.

5
votes

I think this can be easily solved by employing the Longest Common Substring/Subsequence algorithm on one of the strings (e.g. "conair") and the other string appended to itself once (e.g. "aircon" -> "airconaircon").

Sample code in C:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

// Returns the length of the longest common substring (LCS)
// between two given strings.
//
// This recursive implementation can be replaced by a
// more performant dynamic programming implementation.
size_t llcs(const char* s1, const char* s2)
{
  size_t len[3];

  if (*s1 == '\0' || *s2 == '\0') return 0;

  len[0] = (*s1 == *s2) + llcs(s1 + 1, s2 + 1);
  len[1] = llcs(s1 + 1, s2);
  len[2] = llcs(s1, s2 + 1);

  if (len[0] < len[1]) len[0] = len[1];
  if (len[0] < len[2]) len[0] = len[2];

  return len[0];
}

// Returns similarity of two given strings in the range
// from 0.0 to 1.0 (1.0 for equal strings).
double similarity(const char* s1, const char* s2)
{
  size_t s1len = strlen(s1);
  size_t s2len = strlen(s2);
  double sim;

  if (s1len == 0 && s2len == 0)
  {
    // Two empty strings are equal
    sim = 1;
  }
  else
  {
    size_t len;
    // Append s1 to itself in s1s1 (e.g. "aircon" -> "airconaircon")
    char* s1s1 = malloc(s1len * 2 + 1);
    strcpy(s1s1, s1);
    strcpy(s1s1 + s1len, s1);

    // Find the length of the LCS between s1s1 and s2
    // (e.g. between "airconaircon" and "conair")
    len = llcs(s1s1, s2);
    // We need it not longer than s1 (e.g. "aircon")
    // since we're actually comparing s1 and s2
    if (len > s1len) len = s1len;

    len *= 2;

    // Prevent 100% similarity between a string and its
    // cyclically shifted version (e.g. "aircon" and "conair")
    if (len == s1len + s2len && strcmp(s1, s2) != 0) len--;

    // Get the final measure of the similarity
    sim = (double)len / (s1len + s2len);

    free(s1s1);
  }

  return sim;
}

int main(int argc, char** argv)
{
  if (argc == 3)
    printf("Similarity of \"%s\" and \"%s\" is %.2f%%\n",
           argv[1], argv[2], 100 * similarity(argv[1], argv[2]));
  else
    printf("Usage:\n  %s string1 string2\n",
           argv[0]);
  return 0;
}

Sample output:

Similarity of "123" and "123" is 100.00%
Similarity of "123" and "1234" is 85.71%
Similarity of "0123" and "123" is 85.71%
Similarity of "a" and "aa" is 66.67%
Similarity of "aa" and "a" is 66.67%
Similarity of "aaaaaaa" and "aaaaaa" is 92.31%
Similarity of "aaaaaa" and "aaaaaaa" is 92.31%
Similarity of "aircon" and "conair" is 91.67%
Similarity of "spit" and "pits" is 87.50%
Similarity of "pits" and "spit" is 87.50%
Similarity of "spits" and "pits" is 88.89%
Similarity of "pits" and "spits" is 88.89%
2
votes

It sounds like you may want to try doing Levenshtein distance using syllables or phonemes instead of letters.

2
votes

Theoretically, the approach you are using is correct for the problem that you are trying to solve. But Levenstein would consider only the individual characters the two sets.

The string similarity could also be found using Longest Common Subsequence method and then you can see Levenstein on rest of the unmatched.

In case you want to do a clustered approach, the following answer seems to have some details, but obviously it is more difficult to implement.

2
votes

Sorting the words and finding Levenshtein would give a 100% match for your example but it would also give a 100% match for, for e.g.

CONAIR
RCIAON

which might not be what you want.

The other way to define similarity would be to find out common substrings for 2 strings. You can create a Suffix Tree and find out all common substrings and try to determine how similar they are. So for your e.g. the suffix tree would give common substrings as CON & AIR which covers the whole word (for your 2 strings) and thus concluding them as similar.

2
votes

try using other similarity measures like sorenson,jaccard and jaro_winkler

personally i am huge fan of jaro winkler since it served my purpose many times.

from Levenshtein import jaro_winkler
In [2]: jaro_winkler("conair","aircon")
Out[2]: 0.8333333333333334
1
votes

Take a look to Needleman-Wunsch, or Smith-Waterman algorithms. They are used to handle string matching via adapted edit-distance for DNA sequences, where any kind of insertions, reversals, transposons may happen, to any length, at any place. Saying this, I need to add that for a sufficiently long string there is no optimal solution. And don't forget that the edit cost depend on the use-context of the algorithm (a semantics issue), while any algorithm is always a syntactical machine.