1. Overview
In this tutorial, we’ll learn about the xmllint command-line tool. Particularly, we’ll learn the wide range of functionalities offered by xmllint in the context of handling XML files through sample use cases.
2. xmllint
XML is a markup language that’s being used widely to structure and transfer data across the wire. While there are tonnes of libraries and frameworks that allow us to parse and handle XML documents, xmllint is one of the most versatile XML command-line tools in Linux.
2.1. Installation
To install xmllint in Debian based Linux, we could install the libxml2-utils package with apt-get:
$ sudo apt-get update -qq
$ sudo apt-get install -y libxml2-utils
On the other hand, in RHEL based Linux (such as CentOS), we’ll need to install the xmlstarlet package using yum:
$ sudo yum update -qq
$ sudo yum install -y xmlstarlet
2.2. General Syntax
Generally, we run the xmllint command with a list of optional flags and one or more XML file paths at the end:
xmllint [options] xml_file_1 xml_file_2 ...
Let’s look at some use cases for xmllint.
3. Parsing and Formatting XML
3.1. Parsing and Quick Validation
When we run xmllint on an XML file without any options, xmllint will simply parse the file and display the content to the standard output. If the parsing is successful and the content is displayed on the standard output without any error, we can ensure that the XML file is well-formed. Therefore, we could use xmllint as a quick way to verify if an XML document is corrupted or not.
For example, when we run xmllint on an existing laptop.xml XML file, we’ll see the parsed content:
$ xmllint laptop.xml
<?xml version="1.0"?>
<specification>
<type>laptop</type>
<model>macbook</model>
<screenSizeInch>15</screenSizeInch>
</specification>
We can additionally add the –noout option to suppress the xmllint from printing the content of XML file to standard output:
$ xmllint --noout laptop.xml
$ echo $?
0
On the other hand, xmllint will return an error if an XML file is malformed. To demonstrate the scenario, let’s copy the laptop.xml into a separate file laptop-malformed.xml. Then, we remove the ending tag of
$ cat laptop-malformed.xml
<specification>
<type>laptop</type>
<model>macbook</model>
<screenSizeInch>15
</specification>
Now, when we run the xmllint command on laptop-malformed.xml, it will result in an error:
$ xmllint --noout laptop-malformed.xml
laptop-malformed.xml:5: parser error : Opening and ending tag mismatch: screenSizeInch line 4 and specification
</specification>
^
laptop-malformed.xml:6: parser error : EndTag: '</' not found
^
$ echo $?
1
As expected, xmllint complains about the missing end tag that we’ve removed in the XML file.
3.2. Prettifying XML
To format and prettify an XML document, we can run xmllint with the –format argument. For instance, let’s say our laptop-unformatted.xml is unformatted and not indented:
$ cat laptop-unformatted.xml
<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE specification SYSTEM "laptop.dtd"><specification>
<type>laptop</type><model>macbook</model><screenSizeInch>15</screenSizeInch></specification>
We could reformat the document so that it’s much more readable using xmllint:
$ xmllint --format laptop-unformatted.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE specification SYSTEM "laptop.dtd">
<specification>
<type>laptop</type>
<model>macbook</model>
<screenSizeInch>15</screenSizeInch>
</specification>
Additionally, we can change the indentation characters by setting the environment variable XMLLINT_INDENT. For instance, instead of two spaces, we can reformat our XML document with an eight-space indent.
To do so, we first set the XMLLINT_INDENT environment variables:
$ export XMLLINT_INDENT=" "
Then, the command xmllint –format will format the XML document with indentation of 8 spaces:
$ xmllint --format laptop-unformatted.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE specification SYSTEM "laptop.dtd">
<specification>
<type>laptop</type>
<model>macbook</model>
<screenSizeInch>15</screenSizeInch>
</specification>
3.3. Removing Ignorable Empty Spaces
To keep the XML document small, we can remove the indentation spaces and newlines using xmllint with the –noblanks argument:
$ cat laptop.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE specification SYSTEM "laptop.dtd">
<specification>
<type>laptop</type>
<model>macbook</model>
<screenSizeInch>15</screenSizeInch>
</specification>
$ xmllint --noblanks laptop.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE specification SYSTEM "laptop.dtd">
<specification><type>laptop</type><model>macbook</model><screenSizeInch>15</screenSizeInch></specification>
3.4. Removing DTD from Output
To remove DTD from an XML, we can run xmllint with –dropdtd option:
$ cat laptop-w-dtd.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE specification [
<!ELEMENT specification (type,model,screenSizeInch,hasBluetooth)>
<!ELEMENT type (#PCDATA)>
<!ELEMENT model (#PCDATA)>
<!ELEMENT screenSizeInch (#PCDATA)>
<!ELEMENT hasBluetooth (#PCDATA)>
]>
<specification>
<type>laptop</type>
<model>macbook</model>
<screenSizeInch>15</screenSizeInch>
</specification>
$ xmllint --dropdtd laptop-w-dtd.xml
<?xml version="1.0" encoding="UTF-8"?>
<specification>
<type>laptop</type>
<model>macbook</model>
<screenSizeInch>15</screenSizeInch>
</specification>
This option could be helpful when we want to save just the XML document into a separate file without the DTD.
4. Validating XML
In XML, there are two different kinds of validation that one can perform: Document Type Definition (DTD) and XML Schema Definition (XSD). Given the prevalence of both schema-based validations, xmllint supports both validation methods through the option –valid, –dtdvalid, and –schema.
4.1. Validating Against Document Type Definition (DTD)
Document Type Definition (DTD) is definition documents that define what constitutes a valid XML document. It complements its targeted XML document by ensuring it complies with the predefined structure. For example, here’s one possible XML DTD document:
$ cat laptop.dtd
<!ELEMENT specification (type,model,screenSizeInch)>
<!ELEMENT type (#PCDATA)>
<!ELEMENT model (#PCDATA)>
<!ELEMENT screenSizeInch (#PCDATA)>
In the DTD, we can see the definitions for all the nodes, such as specification, type, model, and screenSizeInch. Firstly, we see that the definition of specification node is that it must contain all the child nodes. Subsequently, in each of the child nodes, we also see that they are a type of parsable character data (PCDATA).
Using xmllint, we can validate our XML document against a DTD to verify the validity of the XML document.
For DTD inside the XML itself, validating against them could be as simple as passing the –valid option using xmllint. For example, we can validate laptop-bluetooth.xml with the –valid option alone since the DTD is already within the XML itself:
$ xmllint --noout --valid laptop-bluetooth.xml
laptop-w-dtd.xml:13: element specification: validity error : Element specification content does not follow the DTD, expecting (type , model , screenSizeInch , hasBluetooth), got (type model screenSizeInch )
</specification>
^
To validate against DTD in a separate file, we can use the –dtdvalid option followed by the file path to the DTD. For instance, we can validate laptop.xml against laptop.dtd using xmllint:
$ xmllint --noout --dtdvalid ./laptop.dtd laptop.xml
4.2. Validating Against XML Schema Definition (XSD)
Similar to DTD, XML Schema Definition (XSD) defines the structure of an XML document. However, XSD also allows us to define the data type of different child nodes and even restrict the length of data in nodes, which is not possible with DTD. For our laptop.xml, one example of XSD could be like laptop.xsd:
$ cat laptop.xsd
<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
targetNamespace="https://www.w3schools.com"
xmlns="https://www.w3schools.com"
elementFormDefault="qualified">
<xs:element name="specification">
<xs:complexType>
<xs:sequence>
<xs:element name="type" type="xs:string"/>
<xs:element name="model" type="xs:string"/>
<xs:element name="screenSizeInch" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
To validate an XML document against a given schema, we run xmllint with the –schema option followed by the schema file path. For example:
$ xmllint --noout --schema laptop.xsd laptop.xml
laptop.xml validates
If the validation fails, we’ll see the error message output along with the exit code set to 3, just like DTD’s validation:
$ xmllint --noout --schema laptop.xsd laptop-invalid.xml
laptop-invalid.xml:6: element model: Schemas validity error : Element '{https://www.w3schools.com}model': This element is not expected. Expected is ( {https://www.w3schools.com}type ).
laptop-invalid.xml fails to validate
In the example above, the laptop-invalid.xml is missing the type node, violating the schema laptop.xsd. Hence, the command results in the error we’ve observed in the standard error.
5. Querying XML with XPath
Given the structured nature of XML documents, it’s easy to query and select different nodes in a well-formed XML document. Particularly, XPath is the de-facto querying language to select nodes, attributes, or text in an XML document.
To apply a valid XPath on an XML document, we can run the xmllint command while passing the –xpath option. For example, we could extract the screenSizeInch node of the laptop.xml document using the XPath expression //screenSizeInch:
$ xmllint --xpath //screenSizeInch laptop.xml
<screenSizeInch>15</screenSizeInch>
Similarly, to extract the text out of node screenSizeInch, we can use the built-in XPath method text():
$ xmllint --xpath '//screenSizeInch/text()' laptop.xml
15
6. Profiling
The xmllint command-line tools also provide basic timing and profiling functionality through the –timing and –repeat arguments.
To get a time profile of how long the command execution takes, we can run xmllint with the –timing option:
$ xmllint --noout --timing laptop.xml
Parsing took 0 ms
Freeing took 0 ms
As we can see, the smallest resolution of the timing profile is in the unit of milliseconds. Since our laptop.xml document is small in size, it is not surprising that it could take less than one millisecond to parse.
However, claiming that the parsing time of the document is 0 milliseconds is definitely misleading. One way we can go about this limitation is to simply repeat the parsing several times and then get an average value.
The xmllint command provides the –repeat option that, when passed, would cause the command to repeat 100 times. In our case, it is exactly what we wanted:
$ xmllint --noout --timing --repeat laptop.xml
100 iterations took 1 ms
From the result, it shows that parsing the laptop.xml 100 times takes one millisecond. From there, we can simply divide by 100 to learn that each iteration took roughly 10 microseconds.
7. xmllint Interactive Mode
xmllint comes with a shell mode. It allows us to use several commands to navigate and explore a given XML document. This is especially helpful when we are exploring a large XML document.
Furthermore, when we move on to the command demonstration, we’ll see that the commands are very similar to what a Linux user use daily to navigate between file system. For example, to see the content of nodes, we use cat. To jump into different nodes, we use cd, and so on.
To start a shell session on an XML document, we run xmllint –shell followed by the XML document name:
$ xmllint --shell laptop.xml
/ >
After running the command, we are dropped into the shell prompt, and we start at the root node of the XML document as indicated by the forward slash.
7.1. Navigating the XML
Similar to Linux’s cd, the cd command in xmllint interactive mode allows us to “go” into different nodes. For example, from the root node, we can cd into specification node:
/ > cd specification
specification >
Notice that once we’ve gone into the different nodes, the command prompt will update the value of the current node accordingly. In our terminal, we see that our current node indicator changed to specification.
Next, we can quickly check the current XML node path we are on using pwd. For example, running pwd while we are in the type node of laptop.xml would tell us exactly the path to that node:
type > pwd
/specification/type
7.2. Printing XML Nodes
Within the shell, we can use the cat command to see the content of nodes. Additionally, this command takes an optional node name as an argument to display the content. If unspecified, it will display the current node’s content. For example, while still on the root node, running cat simply shows the content of the entire document.
/ > cat
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE specification SYSTEM "laptop.dtd">
<specification>
<type>laptop</type>
<model>macbook</model>
<screenSizeInch>15</screenSizeInch>
</specification>
If we specify the path to model instead, we’ll get the output just for that node:
/ > cat /specification/model
-------
<model>macbook</model>
7.3. Writing Current Nodes to File
While on an XML node, we can use the write command followed by a filename to save the current node we are on to that file. For instance, we can first cd into model node:
/ > cd /specification/model
model >
Then, we can save the model node into laptop-model.xml using the write command:
model > write laptop-model.xml
Inspecting the content of laptop-model.xml will show that it only contains the model node:
$ cat laptop-model.xml
<model>macbook</model>
7.4. Quitting the Shell
Finally, to quit the shell, we can use the exit, quit, bye command, or CTRL + C.
8. Summary
In this tutorial, we’ve first started with an introduction to xmllint installation and general syntax. Then, we’ve taken a look at some typical use cases for different purposes. For instance, we see how we can easily parse and format XML documents with different options.
Then, we’ve looked at validation options such as –valid and –schema. Furthermore, we’ve also learned that xmllint can parse and process XPath expression against an XML document.
We then looked at simple profiling techniques we can use, such as –timing and –repeat. Finally, we’ve taken a deep dive into the interactive mode as well as most of its commands.