1. Overview

Sometimes, we would like to remove all HTML tags and extract the text from an HTML document string.

The problem looks pretty straightforward. However, depending on the requirements, it can have different variants.

In this tutorial, we’ll discuss how to do that using Java.

2. Using Regex

Since we’ve already got the HTML as a String variable, we need to do a kind of text manipulation.

When facing text manipulation problems, regular expressions (Regex) could be the first idea coming up.

Removing HTML tags from a string won’t be a challenge for Regex since no matter the start or the end HTML elements, they follow the pattern “< … >”.

If we translate it into Regex, it would be “<[^>]*>” or “<.*?>”.

We should note that Regex does greedy matching by default. That is, the Regex “<.*>” won’t work for our problem since we want to match from ‘*<*‘ until the next ‘*>‘ instead of the last ‘>*‘ in a line.

Now, let’s test if it can remove tags from an HTML source.

2.1. Removing Tags From example1.html

Before we test removing HTML tags, first let’s create an HTML example, say example1.html:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
        "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
    <title>This is the page title</title>
</head>
<body>
    <p>
        If the application X doesn't start, the possible causes could be:<br/>
        1. <a href="maven.com">Maven</a> is not installed.<br/>
        2. Not enough disk space.<br/>
        3. Not enough memory.
    </p>
</body>
</html>

Now, let’s write a test and use String.replaceAll() to remove HTML tags:

String html = ... // load example1.html
String result = html.replaceAll("<[^>]*>", "");
System.out.println(result);

If we run the test method, we see the result:



    This is the page title


    
        If the application X doesn't start, the possible causes could be:
        1. Maven is not installed.
        2. Not enough disk space.
        3. Not enough memory.

The output looks pretty good. This is because all HTML tags have been removed.

It preserves whitespaces from the stripped HTML. But we can easily remove or skip those empty lines or whitespaces when we process the extracted text. So far, so good.

2.2. Removing Tags From example2.html

As we’ve just seen, using Regex to remove HTML tags is pretty straightforward. However, this approach may have problems since we cannot predict what HTML source we’ll get.

For example, an HTML document may have