1. Overview

In this quick tutorial, we’ll focus on the substring functionality of Strings in Java.

We’ll mostly use the methods from the String class and few from Apache Commons’ StringUtils class.

In all of the following examples, we’re going to using this simple String:

String text = "Julia Evans was born on 25-09-1984. "
  + "She is currently living in the USA (United States of America).";

2. Basics of substring

Let’s start with a very simple example here – extracting a substring with the start index:

assertEquals("USA (United States of America).", 
  text.substring(67));

Note how we extracted Julia’s country of residence in our example here.

There’s also an option to specify an end index, but without it – substring will go all the way to the end of the String. 

Let’s do that and get rid of that extra dot at the end, in the example above:

assertEquals("USA (United States of America)", 
  text.substring(67, text.length() - 1));

In the examples above, we’ve used the exact position to extract the substring.

2.1. Getting a Substring Starting at a Specific Character

In case the position needs to be dynamically calculated based on a character or String we can make use of the indexOf method:

assertEquals("United States of America", 
  text.substring(text.indexOf('(') + 1, text.indexOf(')')));

A similar method that can help us locate our substring is lastIndexOf. Let’s use lastIndexOf to extract the year “1984”. Its the portion of text between the last dash and the first dot:

assertEquals("1984",
  text.substring(text.lastIndexOf('-') + 1, text.indexOf('.')));

Both indexOf and lastIndexOf can take a character or a String as a parameter. Let’s extract the text “USA” and the rest of the text in the parenthesis:

assertEquals("USA (United States of America)",
  text.substring(text.indexOf("USA"), text.indexOf(')') + 1));

3. Using subSequence

The String class provides another method called subSequence which acts similar to the substring method.

The only difference is that it returns a CharSequence instead of a String and it can only be used with a specific start and end index:

assertEquals("USA (United States of America)", 
  text.subSequence(67, text.length() - 1));

4. Using Regular Expressions

Regular expressions will come to our rescue if we have to extract a substring that matches a specific pattern.

In the example String, Julia’s date of birth is in the format “dd-mm-yyyy”. We can match this pattern using the Java regular expression API.

First of all, we need to create a pattern for “dd-mm-yyyy”:

Pattern pattern = Pattern.compile("\\d{2}-\\d{2}-\\d{4}");

Then, we’ll apply the pattern to find a match from the given text:

Matcher matcher = pattern.matcher(text);

Upon a successful match we can extract the matched String:

if (matcher.find()) {                                  
    Assert.assertEquals("25-09-1984", matcher.group());
}

For more details on the Java regular expressions check out this tutorial.

5. Using split

We can use the split method from the String class to extract a substring. Say we want to extract the first sentence from the example String. This is quite easy to do using split:

String[] sentences = text.split("\\.");

Since the split method accepts a regex we had to escape the period character. Now the result is an array of 2 sentences.

We can use the first sentence (or iterate through the whole array):

assertEquals("Julia Evans was born on 25-09-1984", sentences[0]);

Please note that there are better ways for sentence detection and tokenization using Apache OpenNLP. Check out this tutorial to learn more about the OpenNLP API.

6. Using Scanner

We generally use Scanner to parse primitive types and Strings using regular expressions. A Scanner breaks its input into tokens using a delimiter pattern, which by default matches whitespace.

Let’s find out how to use this to get the first sentence from the example text:

try (Scanner scanner = new Scanner(text)) {
    scanner.useDelimiter("\\.");           
    assertEquals("Julia Evans was born on 25-09-1984", scanner.next());    
}

In the above example, we have set the example String as the source for the scanner to use.

Then we are setting the period character as the delimiter (which needs to be escaped otherwise it will be treated as the special regular expression character in this context).

Finally, we assert the first token from this delimited output.

If required, we can iterate through the complete collection of tokens using a while loop.

while (scanner.hasNext()) {
   // do something with the tokens returned by scanner.next()
}

7. Maven Dependencies

We can go a bit further and use a useful utility – the StringUtils class – part of the Apache Commons Lang library:

<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-lang3</artifactId>
    <version>3.14.0</version>
</dependency>

You can find the latest version of this library here.

8. Using StringUtils

The Apache Commons libraries add some useful methods for manipulating core Java types. Apache Commons Lang provides a host of helper utilities for the java.lang API, most notably String manipulation methods.

In this example, we’re going to see how to extract a substring nested between two Strings:

assertEquals("United States of America", 
  StringUtils.substringBetween(text, "(", ")"));

There is a simplified version of this method in case the substring is nested in between two instances of the same String:

substringBetween(String str, String tag)

The substringAfter method from the same class gets the substring after the first occurrence of a separator.

The separator isn’t returned:

assertEquals("the USA (United States of America).", 
  StringUtils.substringAfter(text, "living in "));

Similarly, the substringBefore method gets the substring before the first occurrence of a separator.

The separator isn’t returned:

assertEquals("Julia Evans", 
  StringUtils.substringBefore(text, " was born"));

You can check out this tutorial to find out more about String processing using Apache Commons Lang API.

9. Finding Texts Before and After a Substring

So far, we’ve explored various ways to extract the required substring from an input string. Sometimes, our input contains a specific substring, but instead of extracting that substring, we want to find and extract the text before and after it.

An example can help us understand the problem quickly. Let’s say we want to extract the String values before and after the following substring:

"was born on 25-09-1984. She "

Then, we should get these results:

Before: "Julia Evans "
After: "is currently living in the USA (United States of America)."

For simplicity, we assume the input String always contains the substring only once.

So next, let’s figure out how to extract the required values.

9.1. Using the substring() Method

We’ve seen how to extract a substring with substring(). Next, let’s use this method again to solve this problem:

String substring = "was born on 25-09-1984. She ";
int startIdx = text.indexOf(substring);
String before = text.substring(0, startIdx);
String after = text.substring(startIdx + substring.length());
 
assertEquals("Julia Evans ", before);
assertEquals("is currently living in the USA (United States of America).", after);

As the code shows, the indexOf() and substring() approach solves the problem.

9.2. Using the split() Method

Alternatively, we can look at the substring as a delimiter, and then the split() method can extract the results for us:

String substring = "was born on 25-09-1984. She ";
String[] result = text.split(Pattern.quote(substring));
assertEquals(2, result.length);
 
String before = result[0];
String after = result[1];
 
assertEquals("Julia Evans ", before);
assertEquals("is currently living in the USA (United States of America).", after);

As we can see, split() returns a String array with two elements: the before and after results.

This approach does the job. But some might have noticed that we passed Pattern.quote(substring) instead of substring directly to split(). This is because split() expects a regex pattern as its argument. Pattern.quote(substring)* tells the regex engine to treat substring as a literal *String. In other words, no character in substring has special meaning in terms of regex.

An example can clearly show the difference. Let’s say we have the input“This is an *important* issue.”, and want to extract the texts before and after the substring ” *important* “.

Let’s see what we’ll get if we pass the substring directly to split():

String input = "This is an *important* issue.";
String substring = " *important* ";
String[] resultWithoutQuote = input.split(substring);
assertEquals(1, resultWithoutQuote.length);
assertEquals(input, resultWithoutQuote[0]);

As the test shows, *if we pass substring directly to split(), the returned array has only one element: the input itself*. This is because two parts in substring have special meanings:

  • ” *” – Zero or many space characters
  • “t*” – Zero or multiple ‘t‘ characters

That is to say, our input doesn’t contain the ” *important* “ pattern at all. Therefore, split() takes the entire input as the single array element.

But if we pass Pattern.quote(substring) to split(), we get the expected results:

String[] result = input.split(Pattern.quote(substring));
String before = result[0];
String after = result[1];
 
assertEquals("This is an", before);
assertEquals("issue.", after);

So, when we want to treat a regex pattern as a literal String, we should use the Pattern.quote() method.

10. Extract Texts Between Two Strings

Next, let’s examine another practical problem and see how we can solve it using the techniques we’ve learned.

Let’s say we have an input String:

String input = "a <%One%> b <%Two%> c <%Three%>";

As we can see, some words in the text are surrounded by “*<%*” and “*%>*“. Our goal is to extract these substrings. As there are multiple “*<% … %>*” pairs, we would like to obtain a List as the result:

List<String> expected = List.of("One", "Two", "Three");

We can employ the regex’s Pattern and Matcher to do the job. First, we must build a regex to match the text between “*<%*” and “*%>“. This isn’t a challenge to us. We can easily come up with: “<%(.*)%>*“. We put the text between two boundaries in a capturing group to capture them more easily.

Next, let’s create a test to verify if our regex works as expected:

Pattern pattern = Pattern.compile("<%(.*)%>");
Matcher matcher = pattern.matcher(input);
List<String> result = new ArrayList<>();
while (matcher.find()) {
    result.add(matcher.group(1));
}
assertEquals(1, result.size());
assertEquals("One%> b <%Two%> c <%Three", result.get(0));

As the test shows, we didn’t get the expected three-element List. Instead, we only get one element in the result List. This is because ‘*’ is a greedy quantifier in regex. In other words, “<%(.*)%>” extracts the text between the first “<%” and the last “%>” in the input. This isn’t what we want since the input has three pairs of “<%…%>”. Greedy extraction is a common pitfall when we extract text using regex.

The bug is easy to fix. We can add a question mark after the ‘*’ quantifier to perform non-greedy matching: “*<%(.*?)%>*“. Now, the regex matches texts between a “<%” and the next “%>”.

Next, let’s test if it solves the problem this time:

Pattern pattern = Pattern.compile("<%(.*?)%>");
Matcher matcher = pattern.matcher(input);
List<String> result = new ArrayList<>();
while (matcher.find()) {
    result.add(matcher.group(1));
}
assertEquals(expected, result);

As the test shows, we got the expected String List.

11. Conclusion

In this quick article, we found out various ways to extract a substring from a String in Java. We also offer other tutorials for String manipulations in Java.

As always, code snippets can be found over on GitHub.