1. Overview
In this tutorial, we’ll look briefly at the different ways of preserving line breaks when using Jsoup to parse HTML to plain text. We will cover how to preserve line breaks associated with newline (\n) characters, as well as those associated with
and
tags.
2. Preserving \n While Parsing HTML Text
Jsoup removes the newline character (\n) by default from the HTML text and replaces each newline with a space character.
However, to prevent Jsoup from removing the newline characters, we can change the OutputSetting of Jsoup and disable pretty-print. If pretty-print is disabled, the HTML output methods will not re-format the output, and the output will look like the input:
Document.OutputSettings outputSettings = new Document.OutputSettings();
outputSettings.prettyPrint(false);
Furthermore, we can use Jsoup#clean to remove all the HTML tags from the string:
String strHTML = "<html><body>Hello\nworld</body></html>";
String strWithNewLines = Jsoup.clean(strHTML, "", Safelist.none(), outputSettings);
Let’s see what our output string strWithNewLines looks like:
assertEquals("Hello\nworld", strWithNewLines);
Therefore, we can see that by calling Jsoup#clean with Safelist#none and disabling the pretty-print output setting of Jsoup, we are able to preserve the line breaks associated with the newline character.
Let’s see what else we can do!
3. Preserving Line Breaks Associated with
and
Tags
While cleaning the HTML text using the Jsoup#clean method, it removes the line breaks created by HTML tags like
and
.
To preserve the line breaks associated with these tags, we first need to create a Jsoup Document from our HTML string:
String strHTML = "<html><body>Hello<br>World<p>Paragraph</p></body></html>";
Document jsoupDoc = Jsoup.parse(strHTML);
Next, we prepend a newline character before the
and
tags — once again, we’re disabling the pretty-print output setting as well:
Document.OutputSettings outputSettings = new Document.OutputSettings();
outputSettings.prettyPrint(false);
jsoupDoc.outputSettings(outputSettings);
jsoupDoc.select("br").before("\\n");
jsoupDoc.select("p").before("\\n");
Here, we used the select method of Jsoup Document along with the before method to prepend the newline character.
After that, we get the HTML string from jsoupDoc retaining the original new lines:
String str = jsoupDoc.html().replaceAll("\\\\n", "\n");
Finally, we call Jsoup#clean with Safelist#none and the pretty-print output setting disabled:
String strWithNewLines = Jsoup.clean(str, "", Safelist.none(), outputSettings);
And our output string strWithNewLines looks like:
assertEquals("Hello\nWorld\nParagraph", strWithNewLines);
Thus, by prepending
and
HTML tags with the newline character, and disabling the pretty-print output setting of Jsoup, we can preserve the line breaks associated with them.
4. Conclusion
In this short article, we learned how to preserve line breaks associated with newline (\n) characters and the
and
tags when parsing HTML into plain text with Jsoup.
As always, all these code samples are available over on GitHub.