Calculating Levenshtein Distance on the Word Level

by Christoph Schiessl on Python

Some time ago, I published an article explaining the Levenshtein algorithm to calculate the distance between a pair of strings, which measures the strings' similarity by counting the number of characters that would need to be deleted/inserted/replaced to transform one string into the other.

Anyway, in the previous article, I used Levenshtein at the character level, and it didn't occur to me that it could also be used at a "higher" level. So, instead of comparing individual words by looking for character differences, you can use the same algorithm to compare sentences by looking for word differences.

To do this, we first have to extract all words from our sentences because lists of words are the input we want to work with. The best solution I could think of is re.findall() with \w+ as the regular expression (i.e., all non-empty sequences of letters).

Python 3.12.2 (main, Mar 29 2024, 14:30:28) [GCC 13.2.1 20230801] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> sentence1 = "This article explains word level Levenshtein distance."
>>> sentence2 = "This article explains character level Levenshtein distance."
>>> print(words1 := re.findall(r"\w+", sentence1))
['This', 'article', 'explains', 'word', 'level', 'Levenshtein', 'distance']
>>> print(words2 := re.findall(r"\w+", sentence2))
['This', 'article', 'explains', 'character', 'level', 'Levenshtein', 'distance']

We also need to slightly update the levenshtein_distance() function that I wrote in my previous article. As I said, we can leave the algorithm as is because it's already sufficient to work with characters, words, or whatever else you may want to throw it at. However, we need to generalize the type annotations of the inputs to work with strs and lists. Luckily, there is the abstract collections.abc.Sequence class, defining operations supported by all sequences, and we can use this to write type hints compatible with both types.

from collections.abc import Sequence


def levenshtein_distance(a: Sequence, b: Sequence) -> int:
    # (1) Early exit if `a` is blank ...
    if len(a) == 0:
        return len(b)

    # (2) Early exit if `b` is blank ...
    if len(b) == 0:
        return len(a)

    # (3) Ignore the common first character of `a` and `b` ...
    if a[0] == b[0]:
        return levenshtein_distance(a[1:], b[1:])

    # (4) Otherwise, `1` plus the minimum of the following ...
    return 1 + min(
        levenshtein_distance(a[1:], b),  # Delete first char of `a`
        levenshtein_distance(a, b[1:]),  # Delete first char of `b`
        levenshtein_distance(a[1:], b[1:]),  # Replace first character
    )

Anyway, I have saved this function to a file named levenshtein.py so that I can import it into the Python console session that I started before.

>>> from levenshtein import *
>>> levenshtein_distance(words1, words2)
1

Voilà! This is the expected output because the two sentences had only a single difference — the word "character" was replaced with "word".

That's everything for today. Thank you for reading, and see you soon!

Christoph Schiessl

Hi, I'm Christoph Schiessl.

I help you build robust and fast Web Applications.


I'm available for hire as a freelance web developer, so you can take advantage of my more than a decade of experience working on many projects across several industries. Most of my clients are building web-based SaaS applications in a B2B context and depend on my expertise in various capacities.

More often than not, my involvement includes hands-on development work using technologies like Python, JavaScript, and PostgreSQL. Furthermore, if you already have an established team, I can support you as a technical product manager with a passion for simplifying complex processes. Lastly, I'm an avid writer and educator who takes pride in breaking technical concepts down into the simplest possible terms.

Continue Reading?

Here are a few more Articles for you ...


Comparing Strings using Levenshtein Distance

Learn about the Levenshtein distance algorithm, a popular and easy-to-implement way to measure the similarity between two strings.

By Christoph Schiessl on Python

Telling Docker Who You Are

Learn how to avoid permission issues when creating files on a Docker bind-mount volume from within a container and manage user IDs and group IDs on Linux.

By Christoph Schiessl on DevOps and Docker

Restarting uvicorn Workers with the SIGHUP Signal

Learn about process management in Python's uvicorn web server and how to use signals to restart workers and to increment/decrement the number of workers.

By Christoph Schiessl on Python

Web App Reverse Checklist

Ready to Build Your Next Web App?

Get my Web App Reverse Checklist first ...


Software Engineering is often driven by fashion, but swimming with the current is rarely the best choice. In addition to knowing what to do, it's equally important to know what not to do. And this is precisely what my free Web App Reverse Checklist will help you with.

Subscribe below to get your free copy of my Reverse Checklist delivered to your inbox. Afterward, you can expect one weekly email on building resilient Web Applications using Python, JavaScript, and PostgreSQL.

By the way, it goes without saying that I'm not sharing your email address with anyone, and you're free to unsubscribe at any time. No spam. No commitments. No questions asked.