Comparing Strings using Levenshtein Distance
by Christoph Schiessl on Python
Levenshtein distance is an easy-to-understand algorithm that measures the similarity of a pair of strings. Imagine that you have received one input string, x
, and your job is to find the string most similar to x
from a list of candidates. To do this, you need a "distance" function to determine if one candidate is more or less similar to x
than some other candidate. This is precisely what Levenshtein distance does. It's certainly not the only such algorithm, but it is very popular because it is easy to implement. The idea behind the algorithm can be explained as a four-step process:
- If
a
is blank (i.e.,len(a) == 0
), then its distance tob
islen(b)
. - If
b
is blank (i.e.,len(b) == 0
), then its distance toa
islen(a)
. - If
a
andb
have a common first character (i.e.,a[0] == b[0]
), then their distance is equal to the distance betweena[1:]
andb[1:]
(ignore first character). Consequently, the distance is0
, if the two strings are equal (i.e.,a == b
). - Otherwise, their distance is
1
plus the minimum of the following three options:- The distance between
a[1:]
andb
(delete the first character ofa
). - The distance between
a
andb[1:]
(delete the first character ofb
). - The distance between
a[1:]
andb[1:]
(the first character is replaced by the first character of the other string).
- The distance between
This was the hard part — the implementation is the easy part because we can simply translate the four steps above 1:1 into Python code:
def levenshtein_distance(a: str, b: str) -> int:
# (1) Early exit if `a` is blank ...
if len(a) == 0:
return len(b)
# (2) Early exit if `b` is blank ...
if len(b) == 0:
return len(a)
# (3) Ignore the common first character of `a` and `b` ...
if a[0] == b[0]:
return levenshtein_distance(a[1:], b[1:])
# (4) Otherwise, `1` plus the minimum of the following ...
return 1 + min(
levenshtein_distance(a[1:], b), # Delete first char of `a`
levenshtein_distance(a, b[1:]), # Delete first char of `b`
levenshtein_distance(a[1:], b[1:]), # Replace first character
)
In case it is not yet clear, Levenshtein distance is counting the number of characters that have to be deleted/inserted/replaced to transform the string a
into the string b
.
When you implement an algorithm like this, it's always a good idea to first write a few test-cases to ensure that your implementation is actually correct. For the sake of completeness, here are the test-cases that I have created for myself:
assert levenshtein_distance("", "") == 0 # equal
assert levenshtein_distance("abcxyz", "abcxyz") == 0 # equal
assert levenshtein_distance("", "abcxyz") == 6 # nothing
assert levenshtein_distance("abcxyz", "") == 6 # nothing
assert levenshtein_distance("a", "abcxyz") == 5 # beginning
assert levenshtein_distance("ab", "abcxyz") == 4 # beginning
assert levenshtein_distance("cx", "abcxyz") == 4 # middle
assert levenshtein_distance("bcxy", "abcxyz") == 2 # middle
assert levenshtein_distance("yz", "abcxyz") == 4 # ending
assert levenshtein_distance("z", "abcxyz") == 5 # ending
assert levenshtein_distance("az", "abcxyz") == 4 # fractured
assert levenshtein_distance("acxz", "abcxyz") == 2 # fractured
assert levenshtein_distance("bcy", "abcxyz") == 3 # fractured
assert levenshtein_distance("123", "abcxyz") == 6 # different chars
assert levenshtein_distance("ab123", "abcxyz") == 4 # different chars
assert levenshtein_distance("1a2c3z", "abcxyz") == 4 # different chars
As you can see, I decided against using a testing framework like pytest
, which would have been overkill, and instead used simple assert
statements. Anyway, thank you for reading, and see you soon!