1. Overview

This quick tutorial will show how to find the difference between two strings using Java.

For this tutorial, we’re going to use two existing Java libraries and compare their approaches to this problem.

2. The Problem

Let’s consider the following requirement: we want to find the difference between the strings ABCDELMN” and “ABCFGLMN”.

Depending on what format we need the output to be, and ignoring the possibility to write our custom code to do so, we found two main options available.

The first one is a library written by Google called diff-match-patch. As they claim, the library offers robust algorithms for synchronizing plain text.

The other option is the StringUtils class from Apache Commons Lang.

Let’s explore the differences between these two.

3. diff-match-patch

For the purpose of this article, we will use a fork of the original Google library, as the artifacts for the original one are not released on Maven Central. Also, some class names are different from the original codebase and are more adherent to the Java standards.

First, we’ll need to include its dependency in our pom.xml file:

<dependency>
    <groupId>org.bitbucket.cowwoc</groupId>
    <artifactId>diff-match-patch</artifactId>
    <version>1.2</version>
</dependency>

Then, let’s consider this code:

String text1 = "ABCDELMN";
String text2 = "ABCFGLMN";
DiffMatchPatch dmp = new DiffMatchPatch();
LinkedList<Diff> diff = dmp.diffMain(text1, text2, false);

If we run the above code – which produces the difference between text1 and text2 – printing the variable diff will produce this output:

[Diff(EQUAL,"ABC"), Diff(DELETE,"DE"), Diff(INSERT,"FG"), Diff(EQUAL,"LMN")]

In fact, the output will be a list of Diff objects, each one being formed by an operation type (INSERT, DELETE or EQUAL), and the portion of text associated with the operation.

When running the diff between text2 and text1, we will get this result:

[Diff(EQUAL,"ABC"), Diff(DELETE,"FG"), Diff(INSERT,"DE"), Diff(EQUAL,"LMN")]

4. StringUtils

The class from Apache Commons has a more simplistic approach.

First, we’ll add the Apache Commons Lang dependency to our pom.xml file:

<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-lang3</artifactId>
    <version>3.14.0</version>
</dependency>

Then, to find the difference between two texts with Apache Commons, we’d call StringUtils#Difference:

StringUtils.difference(text1, text2)

The output produced will be a simple string:

FGLMN

Whereas running the diff between text2 and text1 will return:

DELMN

This simple approach can be enhanced using StringUtils.indexOfDifference(), which will return the index at which the two strings start to differ (in our case, the fourth character of the string). This index can be used to get a substring of the original string, to show what is common between the two inputs, in addition to what’s different.

5. Performance

For our benchmarks, we generate a list of 10,000 strings with a fixed portion of 10 characters, followed by 20 random alphabetic characters.

We then loop through the list and perform a diff between the nth element and the n+1th element of the list:

@Benchmark
public int diffMatchPatch() {
    for (int i = 0; i < inputs.size() - 1; i++) {
        diffMatchPatch.diffMain(inputs.get(i), inputs.get(i + 1), false);
    }
    return inputs.size();
}
@Benchmark
public int stringUtils() {
    for (int i = 0; i < inputs.size() - 1; i++) {
        StringUtils.difference(inputs.get(i), inputs.get(i + 1));
    }
    return inputs.size();
}

Finally, let’s run the benchmarks and compare the two libraries:

Benchmark                                   Mode  Cnt    Score   Error  Units
StringDiffBenchmarkUnitTest.diffMatchPatch  avgt   50  130.559 ± 1.501  ms/op
StringDiffBenchmarkUnitTest.stringUtils     avgt   50    0.211 ± 0.003  ms/op

6. Conclusion

In terms of pure execution speed, StringUtils is clearly more performant, although it only returns the substring from which the two strings start to differ.

At the same time, Diff-Match-Patch provides a more thorough comparison result, at the expense of performance.

The implementation of these examples and snippets is available over on GitHub.


» 下一篇: Java中的逻辑回归