1. Introduction

In this article, we describe the Levenshtein distance, alternatively known as the Edit distance. The algorithm explained here was devised by a Russian scientist, Vladimir Levenshtein, in 1965.

We’ll provide an iterative and a recursive Java implementation of this algorithm.

2. What Is the Levenshtein Distance?

The Levenshtein distance is a measure of dissimilarity between two Strings. Mathematically, given two Strings x and y, the distance measures the minimum number of character edits required to transform x into y.

Typically three type of edits are allowed:

  1. Insertion of a character c
  2. Deletion of a character c
  3. Substitution of a character c with c

Example: If x = ‘shot’ and y = ‘spot’, the edit distance between the two is 1 because ‘shot’ can be converted to ‘spot’ by substituting ‘h‘ to ‘p‘.

In certain sub-classes of the problem, the cost associated with each type of edit may be different.

For example, less cost for substitution with a character located nearby on the keyboard and more cost otherwise. For simplicity, we’ll consider all costs to be equal in this article.

Some of the applications of edit distance are:

  1. Spell Checkers – detecting spelling errors in text and find the closest correct spelling in dictionary
  2. Plagiarism Detection (refer – IEEE Paper)
  3. DNA Analysis – finding similarity between two sequences
  4. Speech Recognition (refer – Microsoft Research)

3. Algorithm Formulation

Let’s take two Strings x and y of lengths m and n respectively. We can denote each String as x[1:m] and y[1:n].

We know that at the end of the transformation, both Strings will be of equal length and have matching characters at each position. So, if we consider the first character of each String, we’ve got three options:

  1. Substitution:
    1. Determine the cost (D1) of substituting x[1] with y[1]. The cost of this step would be zero if both characters are same. If not, then the cost would be one
    2. After step 1.1, we know that both Strings start with the same character. Hence the total cost would now be the sum of the cost of step 1.1 and the cost of transforming the rest of the String x[2:m] into y[2:n]
  2. Insertion:
    1. Insert a character in x to match the first character in y, the cost of this step would be one
    2. After 2.1, we have processed one character from y. Hence the total cost would now be the sum of the cost of step 2.1 (i.e., 1) and the cost of transforming the full x[1:m] to remaining y (y[2:n])
  3. Deletion:
    1. Delete the first character from x, the cost of this step would be one
    2. After 3.1, we have processed one character from x, but the full y remains to be processed. The total cost would be the sum of the cost of 3.1 (i.e., 1) and the cost of transforming remaining x to the full y

The next part of the solution is to figure out which option to choose out of these three. Since we do not know which option would lead to minimum cost at the end, we must try all options and choose the best one.

4. Naive Recursive Implementation

We can see that the second step of each option in section #3 is mostly the same edit distance problem but on sub-strings of the original Strings. This means after each iteration we end up with the same problem but with smaller Strings.

This observation is the key to formulate a recursive algorithm. The recurrence relation can be defined as:

D(x[1:m], y[1:n]) = min {

D(x[2:m], y[2:n]) + Cost of Replacing x[1] to y[1],

D(x[1:m], y[2:n]) + 1,

D(x[2:m], y[1:n]) + 1

}

We must also define base cases for our recursive algorithm, which in our case is when one or both Strings become empty:

  1. When both Strings are empty, then the distance between them is zero
  2. When one of the Strings is empty, then the edit distance between them is the length of the other String, as we need that many numbers of insertions/deletions to transform one into the other:
    • Example: if one String is “dog” and other String is “” (empty), we need either three insertions in empty String to make it “dog”, or we need three deletions in “dog” to make it empty. Hence the edit distance between them is 3

A naive recursive implementation of this algorithm:

public class EditDistanceRecursive {

   static int calculate(String x, String y) {
        if (x.isEmpty()) {
            return y.length();
        }

        if (y.isEmpty()) {
            return x.length();
        } 

        int substitution = calculate(x.substring(1), y.substring(1)) 
         + costOfSubstitution(x.charAt(0), y.charAt(0));
        int insertion = calculate(x, y.substring(1)) + 1;
        int deletion = calculate(x.substring(1), y) + 1;

        return min(substitution, insertion, deletion);
    }

    public static int costOfSubstitution(char a, char b) {
        return a == b ? 0 : 1;
    }

    public static int min(int... numbers) {
        return Arrays.stream(numbers)
          .min().orElse(Integer.MAX_VALUE);
    }
}

This algorithm has the exponential complexity. At each step, we branch-off into three recursive calls, building an O(3^n) complexity.

In the next section, we’ll see how to improve upon this.

5. Dynamic Programming Approach

On analyzing the recursive calls, we observe that the arguments for sub-problems are suffixes of the original Strings. This means there can only be m*n unique recursive calls (where m and n are a number of suffixes of x and y). Hence the complexity of the optimal solution should be quadratic, O(m*n).

Lets look at some of the sub-problems (according to recurrence relation defined in section #4):

  1. Sub-problems of D(x[1:m], y[1:n]) are D(x[2:m], y[2:n]), D(x[1:m], y[2:n]) and D(x[2:m], y[1:n])
  2. Sub-problems of D(x[1:m], y[2:n]) are D(x[2:m], y[3:n]), D(x[1:m], y[3:n]) and D(x[2:m], y[2:n])
  3. Sub-problems of D(x[2:m], y[1:n]) are D(x[3:m], y[2:n]), D(x[2:m], y[2:n]) and D(x[3:m], y[1:n])

In all three cases, one of the sub-problems is D(x[2:m], y[2:n]). Instead of calculating this three times like we do in the naive implementation, we can calculate this once and reuse the result whenever needed again.

This problem has a lot of overlapping sub-problems, but if we know the solution to the sub-problems, we can easily find the answer to the original problem. Therefore, we have both of the properties needed for formulating a dynamic programming solution, i.e., Overlapping Sub-Problems and Optimal Substructure.

We can optimize the naive implementation by introducing memoization, i.e., store the result of the sub-problems in an array and reuse the cached results.

Alternatively, we can also implement this iteratively by using a table based approach:

static int calculate(String x, String y) {
    int[][] dp = new int[x.length() + 1][y.length() + 1];

    for (int i = 0; i <= x.length(); i++) {
        for (int j = 0; j <= y.length(); j++) {
            if (i == 0) {
                dp[i][j] = j;
            }
            else if (j == 0) {
                dp[i][j] = i;
            }
            else {
                dp[i][j] = min(dp[i - 1][j - 1] 
                 + costOfSubstitution(x.charAt(i - 1), y.charAt(j - 1)), 
                  dp[i - 1][j] + 1, 
                  dp[i][j - 1] + 1);
            }
        }
    }

    return dp[x.length()][y.length()];
}

This algorithm performs significantly better than the recursive implementation. However, it involves significant memory consumption.

This can further be optimized by observing that we only need the value of three adjacent cells in the table to find the value of the current cell.

6. Conclusion

In this article, we described what is Levenshtein distance and how it can be calculated using a recursive and a dynamic-programming based approach.

Levenshtein distance is only one of the measures of string similarity, some of the other metrics are Cosine Similarity (which uses a token-based approach and considers the strings as vectors), Dice Coefficient, etc.

As always the full implementation of examples can be found over on GitHub.


« 上一篇: Jackson嵌套字段映射
» 下一篇: Micrometer快速指南