Comparing Strings using Levenshtein Distance

by Christoph Schiessl on [Python]

Levenshtein distance is an easy-to-understand algorithm that measures the similarity of a pair of strings. Imagine that you have received one input string, x, and your job is to find the string most similar to x from a list of candidates. To do this, you need a "distance" function to determine if one candidate is more or less similar to x than some other candidate. This is precisely what Levenshtein distance does. It's certainly not the only such algorithm, but it is very popular because it is easy to implement. The idea behind the algorithm can be explained as a four-step process:

  1. If a is blank (i.e., len(a) == 0), then its distance to b is len(b).
  2. If b is blank (i.e., len(b) == 0), then its distance to a is len(a).
  3. If a and b have a common first character (i.e., a[0] == b[0]), then their distance is equal to the distance between a[1:] and b[1:] (ignore first character). Consequently, the distance is 0, if the two strings are equal (i.e., a == b).
  4. Otherwise, their distance is 1 plus the minimum of the following three options:
    • The distance between a[1:] and b (delete the first character of a).
    • The distance between a and b[1:] (delete the first character of b).
    • The distance between a[1:] and b[1:] (the first character is replaced by the first character of the other string).

This was the hard part — the implementation is the easy part because we can simply translate the four steps above 1:1 into Python code:

def levenshtein_distance(a: str, b: str) -> int:
    # (1) Early exit if `a` is blank ...
    if len(a) == 0: return len(b)

    # (2) Early exit if `b` is blank ...
    if len(b) == 0: return len(a)

    # (3) Ignore the common first character of `a` and `b` ...
    if a[0] == b[0]: return levenshtein_distance(a[1:], b[1:])

    # (4) Otherwise, `1` plus the minimum of the following ...
    return 1 + min(
        levenshtein_distance(a[1:], b),      # Delete first char of `a`
        levenshtein_distance(a, b[1:]),      # Delete first char of `b`
        levenshtein_distance(a[1:], b[1:]),  # Replace first character

In case it is not yet clear, Levenshtein distance is counting the number of characters that have to be deleted/inserted/replaced to transform the string a into the string b.

When you implement an algorithm like this, it's always a good idea to first write a few test-cases to ensure that your implementation is actually correct. For the sake of completeness, here are the test-cases that I have created for myself:

assert levenshtein_distance("", "") == 0              # equal
assert levenshtein_distance("abcxyz", "abcxyz") == 0  # equal

assert levenshtein_distance("", "abcxyz") == 6        # nothing
assert levenshtein_distance("abcxyz", "") == 6        # nothing

assert levenshtein_distance("a", "abcxyz") == 5       # beginning
assert levenshtein_distance("ab", "abcxyz") == 4      # beginning

assert levenshtein_distance("cx", "abcxyz") == 4      # middle
assert levenshtein_distance("bcxy", "abcxyz") == 2    # middle

assert levenshtein_distance("yz", "abcxyz") == 4      # ending
assert levenshtein_distance("z", "abcxyz") == 5       # ending

assert levenshtein_distance("az", "abcxyz") == 4      # fractured
assert levenshtein_distance("acxz", "abcxyz") == 2    # fractured
assert levenshtein_distance("bcy", "abcxyz") == 3     # fractured

assert levenshtein_distance("123", "abcxyz") == 6     # different chars
assert levenshtein_distance("ab123", "abcxyz") == 4   # different chars
assert levenshtein_distance("1a2c3z", "abcxyz") == 4  # different chars

As you can see, I decided against using a testing framework like pytest, which would have been overkill, and instead used simple assert statements. Anyway, thank you for reading, and see you soon!