2
votes

I'm trying to create a damerau-levenshtein distance function in JS.

I've found a description off the algorithm on WIkipedia, but they is no implementation off it. It says:

To devise a proper algorithm to calculate unrestricted Damerau–Levenshtein distance note that there always exists an optimal sequence of edit operations, where once-transposed letters are never modified afterwards. Thus, we need to consider only two symmetric ways of modifying a substring more than once: (1) transpose letters and insert an arbitrary number of characters between them, or (2) delete a sequence of characters and transpose letters that become adjacent after deletion. The straightforward implementation of this idea gives an algorithm of cubic complexity: O\left (M \cdot N \cdot \max(M, N) \right ), where M and N are string lengths. Using the ideas of Lowrance and Wagner,[7] this naive algorithm can be improved to be O\left (M \cdot N \right) in the worst case. It is interesting that the bitap algorithm can be modified to process transposition. See the information retrieval section of[1] for an example of such an adaptation.

https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

The section [1] points to http://acl.ldc.upenn.edu/P/P00/P00-1037.pdf which is even more complicated to me.

If I understood this correctly, it's not that easy to create an implementation off it.

Here's the levenshtein implementation I currently use :

levenshtein=function (s1, s2) {
  //       discuss at: http://phpjs.org/functions/levenshtein/
  //      original by: Carlos R. L. Rodrigues (http://www.jsfromhell.com)
  //      bugfixed by: Onno Marsman
  //       revised by: Andrea Giammarchi (http://webreflection.blogspot.com)
  // reimplemented by: Brett Zamir (http://brett-zamir.me)
  // reimplemented by: Alexander M Beedie
  //        example 1: levenshtein('Kevin van Zonneveld', 'Kevin van Sommeveld');
  //        returns 1: 3

  if (s1 == s2) {
    return 0;
  }

  var s1_len = s1.length;
  var s2_len = s2.length;
  if (s1_len === 0) {
    return s2_len;
  }
  if (s2_len === 0) {
    return s1_len;
  }

  // BEGIN STATIC
  var split = false;
  try {
    split = !('0')[0];
  } catch (e) {
    // Earlier IE may not support access by string index
    split = true;
  }
  // END STATIC
  if (split) {
    s1 = s1.split('');
    s2 = s2.split('');
  }

  var v0 = new Array(s1_len + 1);
  var v1 = new Array(s1_len + 1);

  var s1_idx = 0,
    s2_idx = 0,
    cost = 0;
  for (s1_idx = 0; s1_idx < s1_len + 1; s1_idx++) {
    v0[s1_idx] = s1_idx;
  }
  var char_s1 = '',
    char_s2 = '';
  for (s2_idx = 1; s2_idx <= s2_len; s2_idx++) {
    v1[0] = s2_idx;
    char_s2 = s2[s2_idx - 1];

    for (s1_idx = 0; s1_idx < s1_len; s1_idx++) {
      char_s1 = s1[s1_idx];
      cost = (char_s1 == char_s2) ? 0 : 1;
      var m_min = v0[s1_idx + 1] + 1;
      var b = v1[s1_idx] + 1;
      var c = v0[s1_idx] + cost;
      if (b < m_min) {
        m_min = b;
      }
      if (c < m_min) {
        m_min = c;
      }
      v1[s1_idx + 1] = m_min;
    }
    var v_tmp = v0;
    v0 = v1;
    v1 = v_tmp;
  }
  return v0[s1_len];
} 

What are your ideas for building such an algorithm and, if you think it would be too complicated, what could I do to make no difference between 'l' (L lowercase) and 'I' (i uppercase) for example.

1
What do you want to do exactly? You're referring to (so-called) Damerau-Levenshtein distance, but your piece of code contains the classic Levenshtein algorithm. If you only need transposition support, only a few lines of code need to be changed. As for the "no difference" thing between given characters, you must assign given penalties for specific edit operations. This can be dealt with table lookup/insertion, but is likely to be too slow in javascript to be usable for anything.michaelmeyer
Yes, my code is using yet only the simple Levenshtein algorithm. I have a database of screenNames (300 screenNames) , and a OCR scanner that scans a list of screenNames (300 screenNames). But the OCR scanner gives bad results. So I would like to find similarities (that's what I'm currently doing in JS). For example "mikejew_e" is interpreted as "mikeiew e". I'm using the levenshtein algorithm right now (with a max distance of 3), but it is a bit too permissive. (with a distance of 2, I risk loosing some of the matched screenNames)edi9999
Ok, I get it. The basic step to get better results is to assign specific weights to each edit operation. Make the default penalty 1.0, and lower the penalty for the characters your OCR program is likely to misread. For 300 names, this will be fast enough, even in javascript. I just posted a gist of my C implementation: gist.github.com/doukremt/9473228. This may give you an idea of how this is done. I also remember there is pure javascript implementation on github, but can't find the link anymore, sorry!michaelmeyer
Very nice, I will convert that to JS. Exactly what I was looking foredi9999

1 Answers

3
votes

The gist @doukremt gave: https://gist.github.com/doukremt/9473228

gives the following in Javascript.

You can change the weights of operations in the weighter object.

var levenshteinWeighted= function(seq1,seq2)
{
    var len1=seq1.length;
    var len2=seq2.length;
    var i, j;
    var dist;
    var ic, dc, rc;
    var last, old, column;

    var weighter={
        insert:function(c) { return 1.; },
        delete:function(c) { return 0.5; },
        replace:function(c, d) { return 0.3; }
    };

    /* don't swap the sequences, or this is gonna be painful */
    if (len1 == 0 || len2 == 0) {
        dist = 0;
        while (len1)
            dist += weighter.delete(seq1[--len1]);
        while (len2)
            dist += weighter.insert(seq2[--len2]);
        return dist;
    }

    column = []; // malloc((len2 + 1) * sizeof(double));
    //if (!column) return -1;

    column[0] = 0;
    for (j = 1; j <= len2; ++j)
        column[j] = column[j - 1] + weighter.insert(seq2[j - 1]);

    for (i = 1; i <= len1; ++i) {
        last = column[0]; /* m[i-1][0] */
        column[0] += weighter.delete(seq1[i - 1]); /* m[i][0] */
        for (j = 1; j <= len2; ++j) {
            old = column[j];
            if (seq1[i - 1] == seq2[j - 1]) {
                column[j] = last; /* m[i-1][j-1] */
            } else {
                ic = column[j - 1] + weighter.insert(seq2[j - 1]);      /* m[i][j-1] */
                dc = column[j] + weighter.delete(seq1[i - 1]);          /* m[i-1][j] */
                rc = last + weighter.replace(seq1[i - 1], seq2[j - 1]); /* m[i-1][j-1] */
                column[j] = ic < dc ? ic : (dc < rc ? dc : rc);
            }
            last = old;
        }
    }

    dist = column[len2];
    return dist;
}