1. Overview
Extracting specific content from within patterns is common when we work with text processing. Sometimes, when dealing with data that uses square brackets to encapsulate meaningful information, extracting text enclosed within square brackets might be a challenge for us.
In this tutorial, we’ll explore the techniques and methods to extract content between square brackets.
2. Introduction to the Problem
First of all, for simplicity, let’s make two prerequisites to the problem:
- No nested square bracket pairs – For example, patterns like “..[value1 [value2]]..” won’t come as our input.
- Square brackets are always well-paired – For instance, “*.. [value1 …*” is an invalid input.
When discussing input data enclosed within square brackets, we encounter two possible scenarios:
- Input with a single pair of square brackets, as seen in “..[value]..”
- Input with multiple pairs of square brackets, illustrated by “..[value1]..[value2]..[value3]…”
Moving forward, our focus will be on addressing the single-pair scenario first, and then we’ll proceed to adapt the solutions for cases involving multiple pairs. Throughout this tutorial, the primary technique we’ll use to solve these challenges will be Java regular expression (regex).
3. Input With a Single Pair of Square Brackets
Let’s say we’re given a text input:
String INPUT1 = "some text [THE IMPORTANT MESSAGE] something else";
As we can see, the input contains only one square bracket pair, and we aim to get the text in between:
String EXPECTED1 = "THE IMPORTANT MESSAGE";
So next, let’s see how to achieve that.
3.1. The [.*] Idea
A direct approach to this problem involves extracting content between the ‘*[‘ and ‘]*‘ characters. So, we may come up with the regex pattern “[.*]”.
However, we cannot use this pattern directly in our code, as *regex uses ‘[‘ and ‘]‘ for character class definitions.* For example, the “[0-9]” class matches any digit character. We must escape them to *match literal ‘[‘ or ‘]‘*.
Furthermore, our task is extracting instead of matching. Therefore, we can put our target match in a capturing group so that it’s easier to be referenced and extracted later:
String result = null;
String rePattern = "\\[(.*)]";
Pattern p = Pattern.compile(rePattern);
Matcher m = p.matcher(INPUT1);
if (m.find()) {
result = m.group(1);
}
assertThat(result).isEqualTo(EXPECTED1);
Sharp eyes may notice that we only escaped opening ‘*[‘ in the above code. This is because, for brackets and braces, if a closing bracket or brace isn’t preceded by its corresponding opening character, the regex engine interprets it literally. In our example, we escaped ‘\\[‘, so ‘]‘ isn’t preceded by any opening ‘[‘. Thus, ‘]‘ will be treated as a literal ‘]*‘ character.
3.2. Using NOR Character Classes
We’ve solved the problem by extracting “everything” between ‘[‘ and ‘]‘. Here, “everything” consists of characters that aren’t ‘]’.
Regex supports NOR class. For instance, “[^0-9]” matches any non-digit character. Therefore, we can elegantly address this issue by employing regex NOR classes, resulting in the pattern “*\\[([^]]*)*“:
String result = null;
String rePattern = "\\[([^]]*)";
Pattern p = Pattern.compile(rePattern);
Matcher m = p.matcher(INPUT1);
if (m.find()) {
result = m.group(1);
}
assertThat(result).isEqualTo(EXPECTED1);
3.3. Using the split() Method
Java offers the powerful String.split() method to break the input string into pieces. split() supports the regex pattern as the delimiter. Next, let’s see if our problem can be solved by the split() method.
Consider the scenario of “prefix[value]suffix”. If we designate ‘*[‘ or ‘]*‘ as the delimiter, split()* would yield an array: *{“prefix”, “value”, “suffix”}. The next step is relatively straightforward. We can simply take the middle element from the array as a result:
String[] strArray = INPUT1.split("[\\[\\]]");
String result = strArray.length == 3 ? strArray[1] : null;
assertThat(result).isEqualTo(EXPECTED1);
In the code above, we ensure the split result should always have three elements before taking the second element out of the array.
The test passes when we run it. However, *this solution may fail if the input is ending with ‘]‘*:
String[] strArray = "[THE IMPORTANT MESSAGE]".split("[\\[\\]]");
assertThat(strArray).hasSize(2)
.containsExactly("", "THE IMPORTANT MESSAGE");
As the test above shows, our input doesn’t have “prefix” and “suffix” this time. By default, split() discards the trailing empty strings. To solve it, we can *pass a negative limit to *split()**, to tell split() to keep the empty string elements:**
strArray = "[THE IMPORTANT MESSAGE]".split("[\\[\\]]", -1);
assertThat(strArray).hasSize(3)
.containsExactly("", "THE IMPORTANT MESSAGE", "");
Therefore, we can change our solution to cover the corner case:
String[] strArray = INPUT1.split("[\\[\\]]", -1);
String result = strArray.length == 3 ? strArray[1] : null;
...
4. Input With Multiple Square Brackets Pairs
After solving the single “[..]” pair case, extending the solutions to work with multiple “[..]” cases won’t be a challenge for us. Let’s take a new input example:
final String INPUT2 = "[La La Land], [The last Emperor], and [Life of Pi] are all great movies.";
Next, let’s extract the three movie titles from it:
final List<String> EXPECTED2 = Lists.newArrayList("La La Land", "The last Emperor", "Life of Pi");
4.1. The [(.*)] Idea – Non-Greedy Version
The pattern “\\[(.*)]” efficiently facilitates the extraction of desired content from a single “[..]” pair. But this won’t work for inputs with multiple “[..]” pairs. This is because regex does greedy matching by default. In other words, if we match INPUT2 with “\\[(.*)]”, the capturing group will hold the text between the first ‘[‘ and the last ‘]‘: “La La Land], [The last Emperor], [Life of Pi“.
However, we can add a ‘*?*‘ after ‘*’ to ensure regex does a non-greedy match. Additionally, as we’ll extract multiple target values, let’s change if (m.find()) to a while loop:
List<String> result = new ArrayList<>();
String rePattern = "\\[(.*?)]";
Pattern p = Pattern.compile(rePattern);
Matcher m = p.matcher(INPUT2);
while (m.find()) {
result.add(m.group(1));
}
assertThat(result).isEqualTo(EXPECTED2);
4.2. Using Character Classes
The NOR character class solution works for inputs with multiple “[..]” pairs too. We only need to change the if statement to a while loop:
List<String> result = new ArrayList<>();
String rePattern = "\\[([^]]*)";
Pattern p = Pattern.compile(rePattern);
Matcher m = p.matcher(INPUT2);
while (m.find()) {
result.add(m.group(1));
}
assertThat(result).isEqualTo(EXPECTED2);
4.3. Using the split() Method
For inputs with multiple “*[..]*“s, if we split() by the same regex, the result array should have more than three elements. So, we can’t simply take the middle (index=1) one:
Input: "---[value1]---[value2]---[value3]---"
Array: "---", "value1", "---", "value2", "---", "value3", "---"
Index: [0] [1] [2] [3] [4] [5] [6]
However, if we look at the indexes, we find all elements with odd indexes are our target values. Therefore, we can write a loop to get desired elements from *split()*‘s result:
List<String> result = new ArrayList<>();
String[] strArray = INPUT2.split("[\\[\\]]" -1);
for (int i = 1; i < strArray.length; i += 2) {
result.add(strArray[i]);
}
assertThat(result).isEqualTo(EXPECTED2);
5. Conclusion
In this article, we learned how to extract text between square brackets in Java. We learned different regex-related approaches to address the challenge, effectively tackling two problem scenarios.
As always, the complete source code for the examples is available over on GitHub.