Why is the Levenshtein distance score so low for these two strings?

Question

I am using a Levenshtein distance algorithm to find similar strings and I currently have my score for acceptance as 12 (because some of my strings have up to 5 words). But I was suprised to see the below two strings get a score of 11, they seem very different to me..

def string1 = "Facial Fuel"
def string2 = "Calendula Toner"

println("Result is ${distance(string1,string2)}");

def distance(String str1, String str2) {
   def dist = new int[str1.size() + 1][str2.size() + 1]
   (0..str1.size()).each { dist[it][0] = it }
   (0..str2.size()).each { dist[0][it] = it }

    (1..str1.size()).each { i ->
    (1..str2.size()).each { j ->
        dist[i][j] = [dist[i - 1][j] + 1, dist[i][j - 1] + 1, dist[i - 1][j - 1] + ((str1[i - 1] == str2[j - 1]) ? 0 : 1)].min()
       }
   }
   return dist[str1.size()][str2.size()]
}

Is something wrong with the algorithm here or is that the right score?

Do you have any unit tests for known distances? If not, that's the first thing you need to do (with any software development). For starters, take the Wikipedia page on this topic and use their examples: en.wikipedia.org/wiki/Levenshtein_distance — Erwin Bolwidt

Unknown Unknown · Accepted Answer · 2016-01-23T04:25:15

Your result is right.

    string1:   F a c i a     l   . F u   e l
    operation: S C S S S I I C I C S S I C S
    string2:   C a l e n d u l a . T o n e r

('.' means ' ')

Operations are

    C : Copy
    S : Substitue
    I : Insert
    D : Delete

Levenshtein distance is

    S * 7 + I * 4 = 11

Why is the Levenshtein distance score so low for these two strings?

1 Answers