Calculating Levenshtein Distance on the Word Level
by Christoph Schiessl on Python
Some time ago, I published an article explaining the Levenshtein algorithm to calculate the distance between a pair of strings, which measures the strings' similarity by counting the number of characters that would need to be deleted/inserted/replaced to transform one string into the other.
Anyway, in the previous article, I used Levenshtein at the character level, and it didn't occur to me that it could also be used at a "higher" level. So, instead of comparing individual words by looking for character differences, you can use the same algorithm to compare sentences by looking for word differences.
To do this, we first have to extract all words from our sentences because lists of words are the input we want to work with. The best solution I could think of is re.findall()
with \w+
as the regular expression (i.e., all non-empty sequences of letters).
Python 3.12.2 (main, Mar 29 2024, 14:30:28) [GCC 13.2.1 20230801] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> sentence1 = "This article explains word level Levenshtein distance."
>>> sentence2 = "This article explains character level Levenshtein distance."
>>> print(words1 := re.findall(r"\w+", sentence1))
['This', 'article', 'explains', 'word', 'level', 'Levenshtein', 'distance']
>>> print(words2 := re.findall(r"\w+", sentence2))
['This', 'article', 'explains', 'character', 'level', 'Levenshtein', 'distance']
We also need to slightly update the levenshtein_distance()
function that I wrote in my previous article. As I said, we can leave the algorithm as is because it's already sufficient to work with characters, words, or whatever else you may want to throw it at. However, we need to generalize the type annotations of the inputs to work with str
s and list
s. Luckily, there is the abstract collections.abc.Sequence
class, defining operations supported by all sequences, and we can use this to write type hints compatible with both types.
from collections.abc import Sequence
def levenshtein_distance(a: Sequence, b: Sequence) -> int:
# (1) Early exit if `a` is blank ...
if len(a) == 0:
return len(b)
# (2) Early exit if `b` is blank ...
if len(b) == 0:
return len(a)
# (3) Ignore the common first character of `a` and `b` ...
if a[0] == b[0]:
return levenshtein_distance(a[1:], b[1:])
# (4) Otherwise, `1` plus the minimum of the following ...
return 1 + min(
levenshtein_distance(a[1:], b), # Delete first char of `a`
levenshtein_distance(a, b[1:]), # Delete first char of `b`
levenshtein_distance(a[1:], b[1:]), # Replace first character
)
Anyway, I have saved this function to a file named levenshtein.py
so that I can import it into the Python console session that I started before.
>>> from levenshtein import *
>>> levenshtein_distance(words1, words2)
1
Voilà! This is the expected output because the two sentences had only a single difference — the word "character" was replaced with "word".
That's everything for today. Thank you for reading, and see you soon!