Word Level Levenshtein Distance

by Christoph Schiessl on Python

Some time ago, I published an article explaining the Levenshtein algorithm to calculate the distance between a pair of strings, which measures the strings' similarity by counting the number of characters that would need to be deleted/inserted/replaced to transform one string into the other.

Anyway, in the previous article, I used Levenshtein at the character level, and it didn't occur to me that it could also be used at a "higher" level. So, instead of comparing individual words by looking for character differences, you can use the same algorithm to compare sentences by looking for word differences.

To do this, we first have to extract all words from our sentences because lists of words are the input we want to work with. The best solution I could think of is re.findall() with \w+ as the regular expression (i.e., all non-empty sequences of letters).

Python 3.12.2 (main, Mar 29 2024, 14:30:28) [GCC 13.2.1 20230801] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> sentence1 = "This article explains word level Levenshtein distance."
>>> sentence2 = "This article explains character level Levenshtein distance."
>>> print(words1 := re.findall(r"\w+", sentence1))
['This', 'article', 'explains', 'word', 'level', 'Levenshtein', 'distance']
>>> print(words2 := re.findall(r"\w+", sentence2))
['This', 'article', 'explains', 'character', 'level', 'Levenshtein', 'distance']

We also need to slightly update the levenshtein_distance() function that I wrote in my previous article. As I said, we can leave the algorithm as is because it's already sufficient to work with characters, words, or whatever else you may want to throw it at. However, we need to generalize the type annotations of the inputs to work with strs and lists. Luckily, there is the abstract collections.abc.Sequence class, defining operations supported by all sequences, and we can use this to write type hints compatible with both types.

from collections.abc import Sequence

def levenshtein_distance(a: Sequence, b: Sequence) -> int:
    # (1) Early exit if `a` is blank ...
    if len(a) == 0:
        return len(b)

    # (2) Early exit if `b` is blank ...
    if len(b) == 0:
        return len(a)

    # (3) Ignore the common first character of `a` and `b` ...
    if a[0] == b[0]:
        return levenshtein_distance(a[1:], b[1:])

    # (4) Otherwise, `1` plus the minimum of the following ...
    return 1 + min(
        levenshtein_distance(a[1:], b),  # Delete first char of `a`
        levenshtein_distance(a, b[1:]),  # Delete first char of `b`
        levenshtein_distance(a[1:], b[1:]),  # Replace first character

Anyway, I have saved this function to a file named levenshtein.py so that I can import it into the Python console session that I started before.

>>> from levenshtein import *
>>> levenshtein_distance(words1, words2)

VoilĂ ! This is the expected output because the two sentences had only a single difference — the word "character" was replaced with "word".

That's everything for today. Thank you for reading, and see you soon!

Ready to Learn More Web Development?

Join my Mailing List to receive 1-2 useful Articles per week.

I send up to two weekly emails on building performant and resilient Web Apps with Python, JavaScript and PostgreSQL. No spam. Unscubscribe at any time.

Continue Reading?

Here are a few more Articles for you ...

Comparing Strings using Levenshtein Distance

Learn about the Levenshtein distance algorithm, a popular and easy-to-implement way to measure the similarity between two strings.

By Christoph Schiessl on Python

Orphaned Branches in Git

Learn about Git's internal data structure and how orphaned branches can be used to create separate histories with their own root commits.

By Christoph Schiessl on DevOps and Git

Caching Expensive Queries with MATERIALIZED VIEWs

Learn how to use PostgreSQL's MATERIALIZED VIEWs to improve performance of complex queries. Persist query results and refresh them manually or automatically.

By Christoph Schiessl on PostgreSQL

Christoph Schiessl

Christoph Schiessl

Independent Consultant + Full Stack Developer

If you hire me, you can rely on more than a decade of experience, which I have collected working on web applications for many clients across multiple industries. My involvement usually focuses on hands-on development work using various technologies like Python, JavaScript, PostgreSQL, or whichever technology we determine to be the best tool for the job. Furthermore, you can also depend on me in an advisory capacity to make educated technological choices for your backend and frontend teams. Lastly, I can help you transition to or improve your agile development processes.