Programmers often come across algorithms involving splitting strings. In a special scenario, there might be a requirement to split a string based on single or multiple distinct delimiters and also return the delimiters as part of the split operation.
Let's discuss in detail the different available solutions to this String split problem.
The Java universe offers quite a few libraries (java.lang.String, Guava, and Apache Commons, to name a few) to facilitate the splitting of strings in simple and fairly complex cases. Additionally, the feature-rich regular expressions provide extra flexibility in splitting problems that revolve around the matching of a specific pattern.
In regular expressions, look-around assertions indicate that a match is possible either by looking ahead (lookahead) or looking behind (lookbehind) for another pattern, at the current location of the source string. Let's understand this better with an example.
A lookahead assertion Java(?=Baeldung) matches “Java” only if it is followed by “Baeldung”.
Likewise, a negative lookbehind assertion (?<!#)\d+ matches a number only if it is not preceded by ‘#'.
Let's use such look-around assertion regular expressions and devise a solution to our problem.
In all of the examples explained in this article, we're going to use two simple Strings:
String text = "Hello@World@This@Is@A@Java@Program"; String textMixed = "@HelloWorld@This:Is@A#Java#Program";
Let's begin by using the split() method from the String class of the core Java library.
Moreover, we'll evaluate appropriate lookahead assertions, lookbehind assertions, and combinations of them to split the strings as desired.
First of all, let's use the lookahead assertion “((?=@))” and split the string text around its matches:
String splits = text.split("((?=@))");
The lookahead regex splits the string by a forward match of the “@” symbol. The content of the resulting array is:
[Hello,, , , , , ]
Using this regex doesn't return the delimiters separately in the splits array. Let's try an alternate approach.
We can also use a positive lookbehind assertion “((?<=@))” to split the string text:
String splits = text.split("((?<=@))");
However, the resulting output still won't contain the delimiters as individual elements of the array:
[Hello@, World@, This@, Is@, A@, Java@, Program]
We can use the combination of the above two explained look-arounds with a logical-or and see it in action.
The resulting regex “((?=@)|(?<=@))” will definitely give us the desired results. The below code snippet demonstrates this:
String splits = text.split("((?=@)|(?<=@))");
The above regular expression splits the string, and the resulting array contains the delimiters:
[Hello, @, World, @, This, @, Is, @, A, @, Java, @, Program]
Now that we understand the required look-around assertion regular expression, we can modify it based on the different types of delimiters present in the input string.
Let's attempt to split the textMixed as defined previously using a suitable regex:
String splitsMixed = textMixed.split("((?=:|#|@)|(?<=:|#|@))");
It would not be surprising to see the below results after executing the above line of code:
[@, HelloWorld, @, This, :, Is, @, A, #, Java, #, Program]
Considering that now we have clarity on the regex assertions discussed in the above section, let's delve into a Java library offered by Google.
To start with, let's see them in action on the string text containing a single delimiter “@”:
List<String> splits = Splitter.onPattern("((?=@)|(?<=@))").splitToList(text); List<String> splits2 = Splitter.on(Pattern.compile("((?=@)|(?<=@))")).splitToList(text);
The results from executing the above lines of code are quite similar to the ones generated by the split method, except we now have Lists instead of arrays.
Likewise, we can also use these methods to split a string containing multiple distinct delimiters:
List<String> splitsMixed = Splitter.onPattern("((?=:|#|@)|(?<=:|#|@))").splitToList(textMixed); List<String> splitsMixed2 = Splitter.on(Pattern.compile("((?=:|#|@)|(?<=:|#|@))")).splitToList(textMixed);
As we can see, the difference between the above two methods is quite noticeable.
The on() method accepts an argument of java.util.regex.Pattern, whereas the onPattern() method just accepts the separator regex as a String.
We can also take advantage of the Apache Commons Lang project's StringUtils method splitByCharacterType().
It's really important to note that this method works by splitting the input string by the character type as returned by java.lang.Character.getType(char). Here, we don't get to pick or extract the delimiters of our choosing.
Furthermore, it delivers the best results when the source string has a constant case, either upper or lower, throughout:
String splits = StringUtils.splitByCharacterType("pg@no;10@hello;world@this;is@a#10words;Java#Program");
The different character types as seen in the above string are uppercase and lowercase letters, digits, and special characters (@ ; # ).
Hence, the resulting array splits, as expected, looks like:
[pg, @, no, ;, 10, @, hello, ;, world, @, this, ;, is, @, a, #, 10, words, ;, J, ava, #, P, rogram]
In this article, we've seen how to split a string in such a way that the delimiters are also available in the resulting array.
First, we discussed look-around assertions and used them to get the desired results. Later, we used the methods provided by the Guava library to achieve similar results.
Finally, we wrapped up with the Apache Commons Lang library, which provides a more user-friendly method to solve a related problem of splitting a string, also returning the delimiters.
As always, the code used in this article can be found over on GitHub.