1. Overview
When we work with XML, we can use XPath to navigate through elements and attributes in the XML document.
In this tutorial, we’ll discuss how to evaluate XPath expressions under the Linux command line.
2. Our XML Example and XPath Expressions
First of all, let’s create an XML document, books.xml, as the input XML file that we’ll use throughout this tutorial:
<books>
<book id="1" category="linux">
<title lang="en">Linux Device Drivers</title>
<year>2003</year>
<author>Jonathan Corbet</author>
<author>Alessandro Rubini</author>
</book>
<book id="2" category="linux">
<title lang="en">Understanding the Linux Kernel</title>
<year>2005</year>
<author>Daniel P. Bovet</author>
<author>Marco Cesati</author>
</book>
<book id="3" category="novel">
<title lang="en">A Game of Thrones</title>
<year>2013</year>
<author>George R. R. Martin</author>
</book>
<book id="4" category="novel">
<title lang="fr">The Little Prince</title>
<year>1990</year>
<author>Antoine de Saint-Exupéry</author>
</book>
</books>
In our books.xml file, we have four books. Later, we’ll address how to evaluate a couple of XPath expressions under the Linux command line:
- //title[@lang=’fr’] – this XPath expression selects all the title elements of books written in French (the “book” element has a “lang” attribute with a value of “fr“)
- //book[year>2004]/title – this XPath expression selects all the book title elements if the year of publishing is later than 2004 (the “year” element has a value greater than 2004)
In this tutorial, we’re going to discuss three different approaches to work with XPath under the command line:
- Using the xmllint command
- Using the XMLStarlet toolkit
- Using the xidel utility
3. Using the xmllint Command
The xmllint command is installed with the xmllib2 package. Usually, we can use this command to validate XML files, parse XML files, or pretty-print an XML file.
The xmllint command supports a “–xpath” option to evaluate XPath expressions:
xmllint --xpath "XPATH_EXPRESSION" INPUT.xml
It’s worthwhile to mention that, since xmllib2 only implements XPath 1.0, the xmllint command supports only XPath 1.0.
Let’s test with our XPath expressions to see if we can get the expected result.
First, let’s select all title elements of English books in our books.xml:
$ xmllint --xpath "//title[@lang='fr']" books.xml
<title lang="fr">The Little Prince</title>
We got the title element of the book “The Little Prince” in the output. This is correct since it’s the only title element with the lang=”fr” attribute.
Second, let’s test the other XPath expression:
$ xmllint --xpath "//book[year>2004]/title" books.xml
<title lang="en">Understanding the Linux Kernel</title>
<title lang="en">A Game of Thrones</title>
This time, xmllint prints two title elements. Our second XPath expression is also correctly evaluated by the xmllint command.
4. Using the XMLStarlet Toolkit
XMLStarlet is a powerful command-line XML toolkit based on libxml2. Therefore, similar to the xmllint command, XMLStartlet only supports XPath 1.0.
XMLStartlet ships with one executable called xml, which we can use as the short form of the xmlstarlet command.
4.1. XMLStarlet Syntax
The syntax of the xml command is:
xml [options] <command> [command options]
XMLStarlet defines a set of commands to perform different XML operations — for example, ed (edit) to edit or update an XML document, tr (transform) to transform an XML document using XSLT, and so on.
To select data or query XML documents using XPath, we can take the sel (select) command. In fact, the sel command can do much more than XPath expression evaluation.
Basically, the sel command allows us to avoid writing an XSLT stylesheet to perform some XML document queries. It can generate XSLT for us from the combination of command-line options.
That is to say, when we use the sel command, XMLStarlet will convert all our command arguments into XSLT to do the query on the input XML documents.
Let’s have a look at the general syntax of the sel command:
xml sel -t <template options> Input.xml
XSLT template is a fundamental concept of XSLT. Using the sel command, we create a template using the -t option.
In this tutorial, we won’t dive into XSLT transformation. Our goal is to evaluate XPath expressions.
The sel command supports many template options. We’ll introduce two of them: -c and -v because these two template options are pretty commonly used for XPath evaluation.
For example, let’s say the evaluation result of an XPath expression is
- The -c “XPath_Expression” option will apply “*xsl:copy-of” — this creates a copy of the found nodes, so we’ll get:
text * - The -v “XPath_Expression” option will apply “*xsl:value-of*” — this extracts the value of the XML element in the result, so we’ll have: text
4.2. Evaluating XPath Expressions Using the xml sel Command
Now, let’s give it a try with our two XPath expressions.
First, we’ll test our first XPath expression using the xml sel command with the -c template option:
$ xml sel -t -c "//title[@lang='fr']" books.xml
<title lang="fr">The Little Prince</title>
As the output shows, our XPath expression has been correctly evaluated, and we’ve got the expected title element.
Next, let’s have a look at what we’ll get if we use the -v template option:
$ xml sel -t -v "//title[@lang='fr']" books.xml
The Little Prince
This time, we got the text of the title element without XML tags.
Now, let’s test the command with our other XPath expression:
$ xml sel -t -c "/books/book[year>2004]/title" books.xml
<title lang="en">Understanding the Linux Kernel</title><title lang="en">A Game of Thrones</title>
When we use the -c option, the output contains the two expected title elements.
However, the output is not “pretty-printed.” The line breaks between XML elements are somehow swallowed.
This happens because the line breaks between elements are treated as whitespace, meaning that the xsl:copy-of instruction will remove all whitespace between elements.
Next, let’s see what we’ll get if we use the -v option:
$ xml sel -t -v "/books/book[year>2004]/title" books.xml
Understanding the Linux Kernel
A Game of Thrones
As the output shows, when we use the -v option, we’ll get the text of the matching elements, with each value on a separate line.
This time, the line breaks are not removed. That’s because when the result has multiple elements, the xsl:value-of will sit in a xsl:for-each element, something like:
<xsl:for-each select="/books/book[year>2004]/title">
<xsl:value-of select="."/>
</xsl:for-each>
Thus, the text of each matching element will be printed to a separate line.
5. Using the xidel Command
The xidel command is a nice XML/HTML/JSON data extraction utility and supports XPath 3.0.
Extracting data using the xidel command with an XPath expression is pretty straightforward:
xidel [options] --xpath "XPath Expression" XML_INPUT
We can pass some options to control the output, as we’ll see in later examples.
5.1. Extracting Data With an XPath Expression
Let’s try the xidel command with our first XPath expression:
$ xidel --xpath "//title[@lang='fr']" books.xml
**** Retrieving: books.xml ****
**** Processing: books.xml ****
The Little Prince
As we can see in the output, xidel prints status information by default. Also, it extracts the text out of the found elements automatically.
If we want to skip the status messages, we can add the -s option to let xidel work in “silent” mode.
Moreover, we can ask xidel to print the complete XML elements by passing the –printed-node-format=”xml” option:
The screenshot above shows one nice feature of the xidel command: When xidel output is in XML format, it highlights the attributes in the console output.
Next, let’s execute the xidel command with our second XPath expression:
$ xidel -s --printed-node-format="xml" --xpath "/books/book[year>2004]/title" books.xml
<title lang="en">Understanding the Linux Kernel</title>
<title lang="en">A Game of Thrones</title>
As we expected, it prints the two title elements from our sample file.
5.2. Evaluating XPath 3.0 Expressions
Finally, let’s test if the xidel command can work with XPath 3.0 expressions.
The sequence data type has been around since XPath 3.0. So, we’ll write an XPath expression using the sequence data type to print book elements if its publishing year is in a given sequence of values: //book[year=(2004, 2005, 2013, 2020)]
Let’s see if xidel can evaluate this XPath expression and find the books we’re interested in:
$ xidel -s --printed-node-format="xml" --xpath "//book[year=(2004, 2005, 2013, 2020)]" books.xml
<book id="2" category="linux">
<title lang="en">Understanding the Linux Kernel</title>
<year>2005</year>
<author>Daniel P. Bovet</author>
<author>Marco Cesati</author>
</book>
<book id="3" category="novel">
<title lang="en">A Game of Thrones</title>
<year>2013</year>
<author>George R. R. Martin</author>
</book>
Great, it works with the XPath 3.0 expression!
Since xmllint and XMLStarlet only support XPath 1.0, they cannot evaluate this XPath expression:
$ xmllint --xpath "//book[year=(2004, 2005, 2013, 2020)]" books.xml
XPath error : Invalid expression
//book[year=(2004, 2005, 2013, 2020)]
^
XPath evaluation failure
$ xml sel -t -c "//book[year=(2004, 2005, 2013, 2020)]" books.xml
Invalid expression: //book[year=(2004, 2005, 2013, 2020)]
compilation error: element copy-of
xsl:copy-of : could not compile select expression '//book[year=(2004, 2005, 2013, 2020)]'
6. Conclusion
In this article, we’ve introduced how to evaluate XPath expressions under the Linux command line.
We’ve addressed three different utilities to do the job through examples.