使用正则表达式从HTML标签中提取文本

1. Introduction

When working with HTML content in Java, extracting specific text from HTML tags is common. While using regular expressions (regex) for parsing HTML is generally discouraged due to its complex structure, it can sometimes be sufficient for simple tasks.

In this tutorial, we’ll see how to extract text from HTML tags using regex in Java.

2. Using Pattern and Matcher Classes

Java provides the Pattern and Matcher classes from java.util.regex, allowing us to define and apply regular expressions to extract text from strings. Below is an example of how to extract text from a specified HTML tag using regex:

@Test
void givenHtmlContentWithBoldTags_whenUsingPatternMatcherClasses_thenExtractText() {
    String htmlContent = "<div>This is a <b>Baeldung</b> article for <b>extracting text</b> from HTML tags.</div>";
    String tagName = "b";
    String patternString = "<" + tagName + ">(.*?)</" + tagName + ">";
    Pattern pattern = Pattern.compile(patternString);
    Matcher matcher = pattern.matcher(htmlContent);

    List<String> extractedTexts = new ArrayList<>();
    while (matcher.find()) {
        extractedTexts.add(matcher.group(1));
    }

    assertEquals("Baeldung", extractedTexts.get(0));
    assertEquals("extracting text", extractedTexts.get(1));
}

Here, we first define the HTML content, denoted as htmlContent, which contains HTML with tags. Moreover, we specify the tag name tagName as “b” to extract text from tags.

Then, we compile the regex pattern using the compile() method, where patternString is “*(.*?)*” to match and extract text within tags. Afterward, we use a while loop with the find() method to iterate over all matches and add them to the list named extractedTexts.

Finally, we assert that two texts (“Baeldung” and “extracting text“) are extracted from the tags.

To handle cases where tag contents may contain newlines, we can modify the pattern string by adding (?s) as follows:

String patternString = "(?s)<" + tagName + ">(.*?)</" + tagName + ">";

Here, we use a regex pattern “*(?s)
(.*?)
*” with dotall mode enabled (?s) to match
tags across multiple lines.

3. Using JSoup for HTML Parsing and Extraction

For more complex HTML parsing tasks, especially those involving nested tags, using a dedicated library like JSoup is recommended. Let’s demonstrate how to use JSoup to extract text from
tags, including handling nested tags:

@Test void givenHtmlContentWithNestedParagraphTags_thenExtractAllTextsFromHtmlTag() { String htmlContent = "<div>This is a <p>multiline\nparagraph <strong>with nested</strong> content</p> and <p>line breaks</p>.</div>"; Document doc = Jsoup.parse(htmlContent); Elements paragraphElements = doc.select("p"); List<String> extractedTexts = new ArrayList<>(); for (Element paragraphElement : paragraphElements) { String extractedText = paragraphElement.text(); extractedTexts.add(extractedText); } assertEquals(2, extractedTexts.size()); assertEquals("multiline paragraph with nested content", extractedTexts.get(0)); assertEquals("line breaks", extractedTexts.get(1)); }

Here, we use the parse() method to parse the htmlContent string, converting it into a Document object. Next, we employ the select() method on the doc object to select all
elements within the parsed document.

Subsequently, we iterate over the selected paragraphElements collection, extracting text content from each
element using the paragraphElement.text() method.

4. Conclusion

In conclusion, we have explored different approaches to extracting text from HTML tags in Java. Firstly, we discussed using the Pattern and Matcher classes for regex-based text extraction. Additionally, we examined leveraging JSoup for more complex HTML parsing tasks.

As always, the complete source code for the examples is available over on GitHub.

Persistence

REST

Security

1. Introduction

2. Using Pattern and Matcher Classes

3. Using JSoup for HTML Parsing and Extraction

4. Conclusion