如何从HTML/XML文档中删除标签

1. Overview

Both HTML and XML are markup languages but with different purposes. While HTML focuses on structuring and presenting information on web pages, XML is used to store and transport data between different systems. Sometimes, we need to extract or manipulate text within these documents which may involve removing the tags for analysis.

In this tutorial, we’ll discuss removing tags from HTML/XML documents. To achieve this, we’ll use sed, awk, Perl, and Python in the command line.

2. Using sed

sed is a command line tool used to perform text processing and pattern matching on an input stream.For this reason, we’ll use it to remove tags from an HTML or XML document.

Now, let’s use sed to remove the tags:

$ sed -e ':a;N;$!ba;s/<[^>]*>//g' index.html 
    My Blog Website
        Hello New User, welcome to My Blog website where you can find anything

Let’s understand the above command:

-e – specifies that the following command is a script
:a;N;$!ba; – used to create a loop that reads and appends each line in a file until the last line is reached
s/ – indicates this is a substitution expression
<[^>]*> – represents a regular expression that matches any HTML or XML tags
// – used to replace the matched tags with nothing
g – ensures that all occurrences of the pattern in each line are replaced and not just the first occurrence
index.html – represents the input file sed will process

Using the above command, we remove all the tags in the index.html document and print out the file’s content.

Furthermore, to remove tags from an XML document, we’ll use the same command we used above. That is, we’ll just replace the input file with an XML file:

$ sed -e ':a;N;$!ba;s/<[^>]*>//g' names.xml    
      Gambardella, Matthew
      XML Developer's Guide
      Computer

Additionally, we can redirect the output to another file:

$ sed -e ':a;N;$!ba;s/<[^>]*>//g' names.xml > removed_xml_tags.txt

Above, after removing the tags in the names.xml file, we redirect the output of sed to a new file named removed_xml_tags.txt.

3. Using awk

awk is a command-line tool that allows us to search and manipulate data on text files. For instance, let’s use it to remove tags from HTML or XML documents.

To demonstrate, we’ll begin by removing tags in an XML document using awk:

$ awk 'BEGIN {RS="<[^>]+>"} {gsub(/[\t\n ]+/, " "); print}' names.xml > removed_xml_tags.txt

Let’s understand this command:

BEGIN {RS=”<[^>]+>”} – represents a block that is executed before we process the input file. Here, we use RS=”<[^>]+>” to set the record separator (RS) to a regular expression that matches any HTML or XML tags. Moreover, it tells awk to treat each tag as a separate record.
gsub(/[\t\n ]+/, ” “) – represents a global substitution function that replaces all occurrences of one or more tabs, newlines, or spaces with a single space
print – prints out the processed text
names.xml – represents the input file awk will process
> removed_xml_tags.txt – redirects the output of awk to a file named removed_xml_tags.txt

The above command removes all the tags in the names.xml file and redirects the output to a file named removed_xml_tags.txt.

Likewise, to remove tags from an HTML document, we use the same command:

$ awk 'BEGIN {RS="<[^>]+>"} {gsub(/[\t\n ]+/, " "); print}' index.html > removed_html_tags.txt

Here, we remove tags from index.html and then redirect the output to a file named removed_html_tags.txt.

4. Using Perl

Perl is a programming language we can use to manipulate and process text. We can use it for complex string manipulation using regular expressions.

Now, let’s use Perl to remove tags:

$ perl -pe 's/<[^>]*>//g' names.xml 
      Gambardella, Matthew
      XML Developer's Guide
      Computer

Let’s understand the command:

-p – instructs Perl to loop through each line in the input file and print it
-e – allows us to specify a Perl script on the command line
s/ – represents a substitution operator to search for a specific pattern and replace it
<[^>]*> – represents a regular expression used to match HTML and XML tags
//g – deletes all occurrences of the pattern on each line
names.xml – represents the input file to be processed

Here, we remove all the tags in the names.xml file and print the text content.

Next, let’s remove tags from an HTML document. We’ll make use of the HTML::Strip Perl module.

First, we need to install it. On Ubuntu/Debian distributions we use apt:

$ sudo apt install libhtml-strip-perl

On Arch Linux, we use pacman:

$ sudo pacman -S perl-html-strip

Finally, on Fedora, we use dnf:

$ sudo dnf install perl-HTML-Strip

Now, let’s remove the tags:

$ perl -MHTML::Strip -0777 -pe '$_ = HTML::Strip->new()->parse($_)' index.html > removed_html_tags.txt

Let’s break down the above command:

-MHTML::Strip – instructs Perl to import the HTML::Strip module
-0777 – sets the input record separator to undef, allowing the entire file to be read at once
-pe – creates a loop that reads the input file line by line and prints the line after executing the script on each line
$_ = HTML::Strip->new()->parse($_) – represents a Perl script that initializes a new HTML::Strip object and uses it to remove HTML tags from the content stored in the default variable $_
index.html – represents the input file
> removed_html_tags.txt – redirects the output to a file named removed_html_tags.txt

Using the above command, we successfully remove all the tags in the index.html file. We then redirect the output to a new file named removed_html_tags.txt.

5. Using Python

Python is a programming language used to parse and process text. To illustrate, we’ll use it to remove tags from HTML and XML documents.

First, we’ll start by removing tags from an HTML document. Furthermore, we’ll make use of the Beautifulsoup4 library.

Now, we start by installing Beautifulsoup using pip:

$ pip3 install beautifulsoup4

Once installed, let’s go ahead and remove the tags:

$ python3 -c "from bs4 import BeautifulSoup; print(BeautifulSoup(open('index.html', 'r').read(), 'html.parser').get_text())"

Let’s break down this command:

-c – allows us to specify a Python command on the command line
from bs4 import BeautifulSoup – imports BeautifulSoup from the bs4 library which we use for web scraping
open(‘index.html’, ‘r’).read() – opens the index.html file in read mode and reads it’s content
BeautifulSoup(…, ‘html.parser’).get_text() – here, we use BeautifulSoup to parse the HTML content and extract the text without the HTML tags
print(…) – used to print the extracted text to the terminal

Using the above command, we remove tags in the index.html file and print the output to the terminal.

Next, let’s remove tags from an XML document:

$ python3 -c "import xml.etree.ElementTree as ET; print(''.join(ET.fromstring(open('names.xml', 'r').read()).itertext()))"

Let’s understand the above command:

import xml.etree.ElementTree as ET – used to import the ElementTree module from the xml library which we use to parse and manipulate XML documents
open(‘names.xml’, ‘r’).read() – opens the names.xml file in read mode and reads it’s content
ET.fromstring(…) – used to parse the XML file content into an ElementTree object
itertext() – iterates over the text content of the XML element
”.join(…) – joins the text content into a single string
print(…) – prints the results

Here, we remove tags from the names.xml file and print the output to the terminal.

Additionally, we can save the output in another file:

$ python3 -c "import xml.etree.ElementTree as ET; print(''.join(ET.fromstring(open('names.xml', 'r').read()).itertext()))" > removed_xml_tags.txt

Here, we use > to redirect the output to a file named removed_xml_tags.txt instead of printing it on the terminal.

6. Conclusion

In this article, we discussed different methods for removing tags from HTML and XML documents in Linux. To summarize, sed, and awk are suitable for simple tag removal, while Perl and Python are suitable for complex tag removal. We can use any of these methods according to our preference.

Persistence

REST

Security