Comparing Strings using Levenshtein Distance

by Christoph Schiessl on Python

Levenshtein distance is an easy-to-understand algorithm that measures the similarity of a pair of strings. Imagine that you have received one input string, x, and your job is to find the string most similar to x from a list of candidates. To do this, you need a "distance" function to determine if one candidate is more or less similar to x than some other candidate. This is precisely what Levenshtein distance does. It's certainly not the only such algorithm, but it is very popular because it is easy to implement. The idea behind the algorithm can be explained as a four-step process:

  1. If a is blank (i.e., len(a) == 0), then its distance to b is len(b).
  2. If b is blank (i.e., len(b) == 0), then its distance to a is len(a).
  3. If a and b have a common first character (i.e., a[0] == b[0]), then their distance is equal to the distance between a[1:] and b[1:] (ignore first character). Consequently, the distance is 0, if the two strings are equal (i.e., a == b).
  4. Otherwise, their distance is 1 plus the minimum of the following three options:
    • The distance between a[1:] and b (delete the first character of a).
    • The distance between a and b[1:] (delete the first character of b).
    • The distance between a[1:] and b[1:] (the first character is replaced by the first character of the other string).

This was the hard part — the implementation is the easy part because we can simply translate the four steps above 1:1 into Python code:

def levenshtein_distance(a: str, b: str) -> int:
    # (1) Early exit if `a` is blank ...
    if len(a) == 0:
        return len(b)

    # (2) Early exit if `b` is blank ...
    if len(b) == 0:
        return len(a)

    # (3) Ignore the common first character of `a` and `b` ...
    if a[0] == b[0]:
        return levenshtein_distance(a[1:], b[1:])

    # (4) Otherwise, `1` plus the minimum of the following ...
    return 1 + min(
        levenshtein_distance(a[1:], b),  # Delete first char of `a`
        levenshtein_distance(a, b[1:]),  # Delete first char of `b`
        levenshtein_distance(a[1:], b[1:]),  # Replace first character

In case it is not yet clear, Levenshtein distance is counting the number of characters that have to be deleted/inserted/replaced to transform the string a into the string b.

When you implement an algorithm like this, it's always a good idea to first write a few test-cases to ensure that your implementation is actually correct. For the sake of completeness, here are the test-cases that I have created for myself:

assert levenshtein_distance("", "") == 0              # equal
assert levenshtein_distance("abcxyz", "abcxyz") == 0  # equal

assert levenshtein_distance("", "abcxyz") == 6        # nothing
assert levenshtein_distance("abcxyz", "") == 6        # nothing

assert levenshtein_distance("a", "abcxyz") == 5       # beginning
assert levenshtein_distance("ab", "abcxyz") == 4      # beginning

assert levenshtein_distance("cx", "abcxyz") == 4      # middle
assert levenshtein_distance("bcxy", "abcxyz") == 2    # middle

assert levenshtein_distance("yz", "abcxyz") == 4      # ending
assert levenshtein_distance("z", "abcxyz") == 5       # ending

assert levenshtein_distance("az", "abcxyz") == 4      # fractured
assert levenshtein_distance("acxz", "abcxyz") == 2    # fractured
assert levenshtein_distance("bcy", "abcxyz") == 3     # fractured

assert levenshtein_distance("123", "abcxyz") == 6     # different chars
assert levenshtein_distance("ab123", "abcxyz") == 4   # different chars
assert levenshtein_distance("1a2c3z", "abcxyz") == 4  # different chars

As you can see, I decided against using a testing framework like pytest, which would have been overkill, and instead used simple assert statements. Anyway, thank you for reading, and see you soon!

Ready to Learn More Web Development?

Join my Mailing List to receive two articles per week.

I send two weekly emails on building performant and resilient Web Applications with Python, JavaScript and PostgreSQL. No spam. Unscubscribe at any time.

Continue Reading?

Here are a few more Articles for you ...

Word Level Levenshtein Distance

This article explains how to use the Levenshtein algorithm to compare sentences by looking for word differences instead of character differences.

By Christoph Schiessl on Python

Orphaned Branches in Git

Learn about Git's internal data structure and how orphaned branches can be used to create separate histories with their own root commits.

By Christoph Schiessl on DevOps and Git

Generating Random Numbers According to a Probability Distribution

Learn how to generate random numbers in PostgreSQL whose distribution follows the uniform, exponential, or normal probability distribution.

By Christoph Schiessl on PostgreSQL

Christoph Schiessl

Christoph Schiessl

Independent Consultant + Full Stack Developer

If you hire me, you can rely on more than a decade of experience, which I have collected working on web applications for many clients across multiple industries. My involvement usually focuses on hands-on development work using various technologies like Python, JavaScript, PostgreSQL, or whichever technology we determine to be the best tool for the job. Furthermore, you can also depend on me in an advisory capacity to make educated technological choices for your backend and frontend teams. Lastly, I can help you transition to or improve your agile development processes.