4
votes

Suppose you have two lists of strings containing similar items, with changes (eg. List 1: Apples,fruits_b,orange; List2: Fruit,apples,banana,orange_juice).

Given a distance metric such as the Levenshtein distance, what are good algorithms for finding the optimal pairing, that is the pairing that minimizes the sum of distances for all pairings?

The result corresponding to my example would be:

Apples    - apples
fruits_b  - Fruit
orange    - orange_juice
          - banana

Subsidiary question: is there some tool that already implements this or something similar?

1
So you already have all the distances, and you want an algorithm that will compute a matching with the lowest total? Would the Hungarian algorithm work for you?svinja
@svinja: yes, I suppose that would work. Although avoiding computing all the distances would be even better.static_rtti
Fun fact: The Levenshtein distance between "orange" and "banana" is 5, between "orange" and "orange juice" it is 6. (I think the Levenshtein distance is only meaningful when it has small values, smaller then the word length at least.)M Oehm
If the lists are not the same size, for each "missing" word you just add a column or row to the matrix with all distances equal (let's say 0). So you're pretending you have, for example, an extra word in the first list, that has equal distance to all the words in the second list.svinja
Since "Editing distance" already is spatial metaphor: Couldn't you use some kind of nearest-neighbour search where the "coordinates" are word metrics? (These metrics would have to work on the word itself so you can place the words in your word "space", though, and I can't come up with anything sensible besides word length.)M Oehm

1 Answers

5
votes

OK, here's my python solution using the Levenshtein distance and the Hungarian algorithm (both provided by external packages):

from munkres import Munkres
from Levenshtein import distance
from sys import argv

if __name__ == '__main__':
    if len(argv) < 3:
        print("Usage: fuzzy_match.py file file")
        print("Finds the best pairing of lines from the two input files")
        print("using the Levenshtein distance and the Hungarian algorithm")
    w1 = [l.strip() for l in open(argv[1]).readlines()]
    w2 = [l.strip() for l in open(argv[2]).readlines()]
    if len(w1) != len(w2):
        if len(w2) > len(w1):
            w1, w2 = w2, w1
        w2.extend([""]*(len(w1)-len(w2)))
    matrix = []
    for i in w1: 
        row = []
        for j in w2:
            row.append(distance(i.lower(), j.lower()))
        matrix.append(row)
    m = Munkres()
    max_length = max(len(w) for w in w1)
    for i, j in m.compute(matrix):
        print(("{:<%d}{}" % (max_length+10)).format(w1[i], w2[j]))

It works pretty nicely. I'm still curious if anyone can come up with a better algorithm, though!