Calculating Levenshtein Distance on the Word Level

Some time ago, I published an article explaining the Levenshtein algorithm to calculate the distance between a pair of strings, which measures the strings' similarity by counting the number of characters that would need to be deleted/inserted/replaced to transform one string into the other.

Anyway, in the previous article, I used Levenshtein at the character level, and it didn't occur to me that it could also be used at a "higher" level. So, instead of comparing individual words by looking for character differences, you can use the same algorithm to compare sentences by looking for word differences.

To do this, we first have to extract all words from our sentences because lists of words are the input we want to work with. The best solution I could think of is re.findall() with \w+ as the regular expression (i.e., all non-empty sequences of letters).

Python 3.12.2 (main, Mar 29 2024, 14:30:28) [GCC 13.2.1 20230801] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> sentence1 = "This article explains word level Levenshtein distance."
>>> sentence2 = "This article explains character level Levenshtein distance."
>>> print(words1 := re.findall(r"\w+", sentence1))
['This', 'article', 'explains', 'word', 'level', 'Levenshtein', 'distance']
>>> print(words2 := re.findall(r"\w+", sentence2))
['This', 'article', 'explains', 'character', 'level', 'Levenshtein', 'distance']

We also need to slightly update the levenshtein_distance() function that I wrote in my previous article. As I said, we can leave the algorithm as is because it's already sufficient to work with characters, words, or whatever else you may want to throw it at. However, we need to generalize the type annotations of the inputs to work with strs and lists. Luckily, there is the abstract collections.abc.Sequence class, defining operations supported by all sequences, and we can use this to write type hints compatible with both types.

from collections.abc import Sequence


def levenshtein_distance(a: Sequence, b: Sequence) -> int:
    # (1) Early exit if `a` is blank ...
    if len(a) == 0:
        return len(b)

    # (2) Early exit if `b` is blank ...
    if len(b) == 0:
        return len(a)

    # (3) Ignore the common first character of `a` and `b` ...
    if a[0] == b[0]:
        return levenshtein_distance(a[1:], b[1:])

    # (4) Otherwise, `1` plus the minimum of the following ...
    return 1 + min(
        levenshtein_distance(a[1:], b),  # Delete first char of `a`
        levenshtein_distance(a, b[1:]),  # Delete first char of `b`
        levenshtein_distance(a[1:], b[1:]),  # Replace first character
    )

Anyway, I have saved this function to a file named levenshtein.py so that I can import it into the Python console session that I started before.

>>> from levenshtein import *
>>> levenshtein_distance(words1, words2)
1

Voilà! This is the expected output because the two sentences had only a single difference — the word "character" was replaced with "word".

That's everything for today. Thank you for reading, and see you soon!

Calculating Levenshtein Distance on the Word Level

Hi, I'm Christoph Schiessl.

I help you build robust and fast Web Applications.

Continue Reading?

Here are a few more Articles for you ...

Ready to Build Your Next Web App?

Get my Web App Reverse Checklist first ...