1. Overview
Both HTML and XML are markup languages but with different purposes. While HTML focuses on structuring and presenting information on web pages, XML is used to store and transport data between different systems. Sometimes, we need to extract or manipulate text within these documents which may involve removing the tags for analysis.
In this tutorial, we’ll discuss removing tags from HTML/XML documents. To achieve this, we’ll use sed, awk, Perl, and Python in the command line.
2. Using sed
sed is a command line tool used to perform text processing and pattern matching on an input stream.For this reason, we’ll use it to remove tags from an HTML or XML document.
Now, let’s use sed to remove the tags:
$ sed -e ':a;N;$!ba;s/<[^>]*>//g' index.html
My Blog Website
Hello New User, welcome to My Blog website where you can find anything
Let’s understand the above command:
- -e – specifies that the following command is a script
- :a;N;$!ba; – used to create a loop that reads and appends each line in a file until the last line is reached
- s/ – indicates this is a substitution expression
- <[^>]*> – represents a regular expression that matches any HTML or XML tags
- // – used to replace the matched tags with nothing
- g – ensures that all occurrences of the pattern in each line are replaced and not just the first occurrence
- index.html – represents the input file sed will process
Using the above command, we remove all the tags in the index.html document and print out the file’s content.
Furthermore, to remove tags from an XML document, we’ll use the same command we used above. That is, we’ll just replace the input file with an XML file:
$ sed -e ':a;N;$!ba;s/<[^>]*>//g' names.xml
Gambardella, Matthew
XML Developer's Guide
Computer
Additionally, we can redirect the output to another file:
$ sed -e ':a;N;$!ba;s/<[^>]*>//g' names.xml > removed_xml_tags.txt
Above, after removing the tags in the names.xml file, we redirect the output of sed to a new file named removed_xml_tags.txt.
3. Using awk
awk is a command-line tool that allows us to search and manipulate data on text files. For instance, let’s use it to remove tags from HTML or XML documents.
To demonstrate, we’ll begin by removing tags in an XML document using awk:
$ awk 'BEGIN {RS="<[^>]+>"} {gsub(/[\t\n ]+/, " "); print}' names.xml > removed_xml_tags.txt
Let’s understand this command:
- BEGIN {RS=”<[^>]+>”} – represents a block that is executed before we process the input file. Here, we use RS=”<[^>]+>” to set the record separator (RS) to a regular expression that matches any HTML or XML tags. Moreover, it tells awk to treat each tag as a separate record.
- gsub(/[\t\n ]+/, ” “) – represents a global substitution function that replaces all occurrences of one or more tabs, newlines, or spaces with a single space
- print – prints out the processed text
- names.xml – represents the input file awk will process
- > removed_xml_tags.txt – redirects the output of awk to a file named removed_xml_tags.txt
The above command removes all the tags in the names.xml file and redirects the output to a file named removed_xml_tags.txt.
Likewise, to remove tags from an HTML document, we use the same command:
$ awk 'BEGIN {RS="<[^>]+>"} {gsub(/[\t\n ]+/, " "); print}' index.html > removed_html_tags.txt
Here, we remove tags from index.html and then redirect the output to a file named removed_html_tags.txt.
4. Using Perl
Perl is a programming language we can use to manipulate and process text. We can use it for complex string manipulation using regular expressions.
Now, let’s use Perl to remove tags:
$ perl -pe 's/<[^>]*>//g' names.xml
Gambardella, Matthew
XML Developer's Guide
Computer
Let’s understand the command:
- -p – instructs Perl to loop through each line in the input file and print it
- -e – allows us to specify a Perl script on the command line
- s/ – represents a substitution operator to search for a specific pattern and replace it
- <[^>]*> – represents a regular expression used to match HTML and XML tags
- //g – deletes all occurrences of the pattern on each line
- names.xml – represents the input file to be processed
Here, we remove all the tags in the names.xml file and print the text content.
Next, let’s remove tags from an HTML document. We’ll make use of the HTML::Strip Perl module.
First, we need to install it. On Ubuntu/Debian distributions we use apt:
$ sudo apt install libhtml-strip-perl
On Arch Linux, we use pacman:
$ sudo pacman -S perl-html-strip
Finally, on Fedora, we use dnf:
$ sudo dnf install perl-HTML-Strip
Now, let’s remove the tags:
$ perl -MHTML::Strip -0777 -pe '$_ = HTML::Strip->new()->parse($_)' index.html > removed_html_tags.txt
Let’s break down the above command:
- -MHTML::Strip – instructs Perl to import the HTML::Strip module
- -0777 – sets the input record separator to undef, allowing the entire file to be read at once
- -pe – creates a loop that reads the input file line by line and prints the line after executing the script on each line
- $_ = HTML::Strip->new()->parse($_) – represents a Perl script that initializes a new HTML::Strip object and uses it to remove HTML tags from the content stored in the default variable $_
- index.html – represents the input file
- > removed_html_tags.txt – redirects the output to a file named removed_html_tags.txt
Using the above command, we successfully remove all the tags in the index.html file. We then redirect the output to a new file named removed_html_tags.txt.
5. Using Python
Python is a programming language used to parse and process text. To illustrate, we’ll use it to remove tags from HTML and XML documents.
First, we’ll start by removing tags from an HTML document. Furthermore, we’ll make use of the Beautifulsoup4 library.
Now, we start by installing Beautifulsoup using pip:
$ pip3 install beautifulsoup4
Once installed, let’s go ahead and remove the tags:
$ python3 -c "from bs4 import BeautifulSoup; print(BeautifulSoup(open('index.html', 'r').read(), 'html.parser').get_text())"
Let’s break down this command:
- -c – allows us to specify a Python command on the command line
- from bs4 import BeautifulSoup – imports BeautifulSoup from the bs4 library which we use for web scraping
- open(‘index.html’, ‘r’).read() – opens the index.html file in read mode and reads it’s content
- BeautifulSoup(…, ‘html.parser’).get_text() – here, we use BeautifulSoup to parse the HTML content and extract the text without the HTML tags
- print(…) – used to print the extracted text to the terminal
Using the above command, we remove tags in the index.html file and print the output to the terminal.
Next, let’s remove tags from an XML document:
$ python3 -c "import xml.etree.ElementTree as ET; print(''.join(ET.fromstring(open('names.xml', 'r').read()).itertext()))"
Let’s understand the above command:
- import xml.etree.ElementTree as ET – used to import the ElementTree module from the xml library which we use to parse and manipulate XML documents
- open(‘names.xml’, ‘r’).read() – opens the names.xml file in read mode and reads it’s content
- ET.fromstring(…) – used to parse the XML file content into an ElementTree object
- itertext() – iterates over the text content of the XML element
- ”.join(…) – joins the text content into a single string
- print(…) – prints the results
Here, we remove tags from the names.xml file and print the output to the terminal.
Additionally, we can save the output in another file:
$ python3 -c "import xml.etree.ElementTree as ET; print(''.join(ET.fromstring(open('names.xml', 'r').read()).itertext()))" > removed_xml_tags.txt
Here, we use > to redirect the output to a file named removed_xml_tags.txt instead of printing it on the terminal.
6. Conclusion
In this article, we discussed different methods for removing tags from HTML and XML documents in Linux. To summarize, sed, and awk are suitable for simple tag removal, while Perl and Python are suitable for complex tag removal. We can use any of these methods according to our preference.