53
votes

For the problem I'm working on, finding distances between two sequences to determine their similarity, sequence order is very important. However, the sequences that I have are not all the same length, so I pad any deficient strings with empty points such that both sequences are the same length in order to satisfy the Hamming distance requirement. Is there any major problem with me doing this, since all I care about are the number of transpositions (not insertions or deletions like Levenshtein does)?

I've found that Hamming distance is much, much faster than Levenshtein as a distance metric for sequences of longer length. When should one use Levenshtein distance (or derivatives of Levenshtein distance) instead of the much cheaper Hamming distance? Hamming distance can be considered the upper bound for possible Levenshtein distances between two sequences, so if I am comparing the two sequences for a order-biased similarity metric rather than the absolute minimal number of moves to match the sequences, there isn't an apparent reason for me to choose Levenshtein over Hamming as a metric, is there?

2
When you say that "all you care about is the number of transpositions", what do you want to do with the "overhanging" segment when one sequence is longer? Hamming distance will add the difference in length to the total distance.j_random_hacker
By that, you do mean that '123' to '12 ' and '123' to '124' would have the same distance, correct? If so, yes, that's what I want.don
In that case I think you answered your own original question :)j_random_hacker
I'm voting to close this question as off-topic because it should belong to CS SE.nbro

2 Answers

46
votes

That question really depends on the types of sequences you are matching, and what result you want.

If it's not a problem that "1234567890" and "0123456789" are considered totally different, indeed Hamming distance is fine.

6
votes

In addition to the right Johan answer, the padding can be problematic.

For example, when you compare 123 to 123456 it's different if you pad either at the end of the string or at the start of the string. The similarity of ___123 with 123456 is 0, but The similarity of 123___ with 123456 is 3.