1. Overview

Working with strings is a fundamental task in Java programming, and at times, we need to split a string into multiple substrings for further processing. Whether it’s parsing user input or processing data files, knowing how to break strings effectively is essential.

In this tutorial, we’ll explore different approaches and techniques for breaking an input string into a string array or list containing digit and non-digit string elements in the original order.

2. Introduction to the Problem

As usual, let’s understand the problem through examples.

Let’s say we have two input strings:

String INPUT1 = "01Michael Jackson23Michael Jordan42Michael Bolton999Michael Johnson000";
String INPUT2 = "Michael Jackson01Michael Jordan23Michael Bolton42Michael Johnson999Great Michaels";

As the examples above show, both strings consist of consecutive digit and non-digit characters. For example, consecutive digit substrings in INPUT1 are “01“, “23“, “42“, “999“, and “000“. The non-digit substrings are “Michael Jackson“, “Michael Jordan“, “Michael Bolton“, and so on.

INPUT2 is similar. The difference is it starts with a non-digit string. Therefore, we can conclude a few input characteristics:

  • The length of digit or non-digit substrings is dynamic.
  • The input string can start with a digit or non-digit substring.

We aim to break the input string into an array or list of these string elements:

String[] EXPECTED1 = new String[] { "01", "Michael Jackson", "23", "Michael Jordan", "42", "Michael Bolton", "999", "Michael Johnson", "000" };
List<String> EXPECTED_LIST1 = Arrays.asList(EXPECTED1);

String[] EXPECTED2 = new String[] { "Michael Jackson", "01", "Michael Jordan", "23", "Michael Bolton", "42", "Michael Johnson", "999", "Great Michaels" };
List<String> EXPECTED_LIST2 = Arrays.asList(EXPECTED2);

In this tutorial, we’ll solve this problem using both regex-based and non-regex-based approaches. Further, we’ll discuss their performances at the end.

For simplicity, we’ll use unit test assertions to verify whether each approach works as expected.

3. Using the String.split() Method

First, let’s solve this problem using a regex-based approach. We know that the String.split() method is a handy tool for splitting a String into an array. For example: “a, b, c, d”.split(“, “) returns a string array: {“a”, “b”, “c”, “d”}.

So, using the split() method could be the first idea we came up with to solve our problem. Then, we need to find a regex pattern as the separator and guide split() to get the expected result. However, we may realize one difficulty when we think about it twice.

Let’s revisit the *“a, b, c, d”.*split() example. We used “, ” as the separator regex pattern and got the string elements in the array result: “a”, “b”, “c”, and “d”. If we look at the result string elements, we’ll see all matched separators (“, “) aren’t in the result string array.

However, if we look at the inputs and expected outputs of our problem, every character in the input appears in the result array or list. Therefore, if we want to use split() to solve the problem, we must use a pattern of zero-length assertions, for example, the lookaround (lookahead and lookbehind) assertions. Next, let’s analyze our input string:

01[!]Michael Jackson[!]23[!]Michael Jordan[!]42[!]Michael Bolton...

To make it clear, we marked desired separators using ‘*[!]*‘ in the input above. *Each separator sits either between a \d (digit character) and a \D (non-digit character) or between a \D and a \d. If we translate this into a lookaround regex pattern, it’s (?<=\D)(?=\d)|(?<=\d)(?=\D).*

Next, let’s write a test to verify if using split(), with this pattern, on the two inputs produces the desired results:

String splitRE = "(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)";
String[] result1 = INPUT1.split(splitRE);
assertArrayEquals(EXPECTED1, result1);

String[] result2 = INPUT2.split(splitRE);
assertArrayEquals(EXPECTED2, result2);

The test passes if we give it a run. So, we’ve solved the problem using the split() method.

Next, let’s solve the problem using a non-regex approach.

4. A Non-Regex-Based Approach

We’ve seen how to solve the problem using the regex-based split() approach. Alternatively, we can solve it without using pattern matching.

The idea to achieve that is to check through all characters from the beginning of the input string. Next, let’s first look at the implementation and understand how it works:

enum State {
    INIT, PARSING_DIGIT, PARSING_NON_DIGIT
}

List<String> parseString(String input) {
    List<String> result = new ArrayList<>();
    int start = 0;
    State state = INIT;
    for (int i = 0; i < input.length(); i++) {
        if (input.charAt(i) >= '0' && input.charAt(i) <= '9') {
            if (state == PARSING_NON_DIGIT) { // non-digit to digit, get the substring as an element
                result.add(input.substring(start, i));
                start = i;
            }
            state = PARSING_DIGIT;
        } else {
            if (state == PARSING_DIGIT) { // digit to non-digit, get the substring as an element
                result.add(input.substring(start, i));
                start = i;
            }
            state = PARSING_NON_DIGIT;
        }
    }
    result.add(input.substring(start)); // add the last part
    return result;
}

Now, let’s walk through the code above quickly and understand how it works:

  • First, we initialize an empty ArrayList called result to store the extracted elements.
  • int start = 0; – This variable start keeps track of the start index of each substring during the iteration later.
  • The state variable is an enum, which indicates the state while iterating through the string.
  • Then, we use a for loop to iterate through the input string characters and check each character’s type.
  • If the current character is a digit (09) and a non-digit to digit transition, it means an element has ended. So, we add the substring from start to i (exclusive) to the result list. Also, we update the start index to the current index i and set state to the PARSING_DIGIT state.
  • The else block follows a similar logic and handles the digit to non-digit transition scenario.
  • After the for loop ends, we shouldn’t forget to add the last part of the string to the result list by using input.substring(start).

Next, let’s test the parseString() method with our two inputs:

List<String> result1 = parseString(INPUT1);
assertEquals(EXPECTED_LIST1, result1);

List<String> result2 = parseString(INPUT2);
assertEquals(EXPECTED_LIST2, result2);

If we run the test, it passes. So, our parseString() method does the job.

5. Performance

So far, we’ve addressed two solutions to the problem, regex-based and non-regex-based. The regex-based split() solution is pretty straightforward, just one single method call. On the contrary, our dozen-line self-made parseString() method requires controlling every single character in the input on our own. Then, some of us may ask, why’d we introduce or even use the self-made method to solve the problem?

The answer is “performance.”

Although our parseString() solution looks lengthy and requires manual control of each character, it’s faster than the regex-based solution. Let’s understand the reasons for this:

  • The split() solution requires compiling the regex pattern and applying pattern matching. These operations are considered computationally expensive, especially for complex patterns. However, on the other hand, the parseString() method uses a simple enum-based state machine to track transitions between digit and non-digit characters. It allows for direct comparisons and avoids the complexity of regex pattern matching and lookarounds.
  • In the parseString() method, substrings are extracted directly using the substring() method. This approach avoids unnecessary object creation and memory allocations that may occur when using the split() method with regex. Further, by directly extracting substrings using known indices, the parseString() method optimizes memory usage and potentially improves performance.

However, the difference in performance may be negligible if the input string isn’t considerably long.

Next, let’s benchmark the performance of these two approaches. We’ll use JMH (the Java Microbenchmark Harness) to do that. This is because JMH allows us to easily handle benchmarking factors, such as JVM warm-up, dead code elimination, and so on:

@State(Scope.Benchmark)
@Threads(1)
@BenchmarkMode(Mode.Throughput)
@Fork(warmups = 1, value = 1)
@Warmup(iterations = 2, time = 10, timeUnit = TimeUnit.MILLISECONDS)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
public class BenchmarkLiveTest {
    private static final String INPUT = "01Michael Jackson23Michael Jordan42Michael Bolton999Michael Johnson000";

    @Param({ "10000" })
    public int iterations;

    @Benchmark
    public void regexBased(Blackhole blackhole) {
        blackhole.consume(INPUT.split("(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)"));
    }

    @Benchmark
    public void nonRegexBased(Blackhole blackhole) {
        blackhole.consume(parseString(INPUT));
    }

    @Test
    public void benchmark() throws Exception {
        String[] argv = {};
        org.openjdk.jmh.Main.main(argv);
    }
}

As the above class shows, we benchmark the two approaches in 10k iterations using the same input. Of course, we won’t dive into JMH and understand each JMH annotation’s meaning. But two annotations are important for us to understand the final report: @OutputTimeUnit(TimeUnit.MILLISECONDS) and @BenchmarkMode(Mode.Throughput). This combination means we measure how many times we can run each approach per millisecond. 

Next, let’s take a look at the result JMH generates:

Benchmark                        (iterations)   Mode  Cnt     Score     Error   Units
BenchmarkLiveTest.nonRegexBased         10000  thrpt    5  3880.989 ± 134.021  ops/ms
BenchmarkLiveTest.regexBased            10000  thrpt    5   297.282 ±  24.818  ops/ms

As we can see, the non-regex-based solution’s throughput is over 13 (3880/297 = 13.06) times more than the regex-based solution. Therefore, when we need to handle long strings in a performance-critical application, we should choose parseString() over the split() solution.

6. Conclusion

In this article, we’ve explored regex-based (split()) and non-regex-based (parseString()) approaches to breaking an input string into a string array or list containing digit elements and non-digit string elements in the original order.

The split() solution is compact and straightforward. However, when dealing with long input strings, it can be significantly slower than the self-made parseString() solution.

As usual, all code snippets presented in the article are available over on GitHub.