1. Overview

In this article we’re going to go over the basics of XPath with the support in the standard Java JDK.

We are going to use a simple XML document, process it and see how to go over the document to extract the information we need from it.

XPath is a standard syntax recommended by the W3C, it is a set of expressions to navigate XML documents. You can find a full XPath reference here.

2. A Simple XPath Parser

import javax.xml.namespace.NamespaceContext;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;

import org.w3c.dom.Document;

public class DefaultParser {
    
    private File file;

    public DefaultParser(File file) {
        this.file = file;
    }
}

Now lets take a closer look to the elements you will find in the DefaultParser:

FileInputStream fileIS = new FileInputStream(this.getFile());
DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = builderFactory.newDocumentBuilder();
Document xmlDocument = builder.parse(fileIS);
XPath xPath = XPathFactory.newInstance().newXPath();
String expression = "/Tutorials/Tutorial";
NodeList nodeList = (NodeList) xPath.compile(expression).evaluate(xmlDocument, XPathConstants.NODESET);

Let’s break that down:

DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();

We will use this object to produce a DOM object tree from our xml document:

DocumentBuilder builder = builderFactory.newDocumentBuilder();

Having an instance of this class, we can parse XML documents from many different input sources like InputStream, File, URL and SAX:

Document xmlDocument = builder.parse(fileIS);

A Document (org.w3c.dom.Document) represents the entire XML document, is the root of the document tree, provides our first access to data:

XPath xPath = XPathFactory.newInstance().newXPath();

From the XPath object we’ll access the expressions and execute them over our document to extract what we need from it:

xPath.compile(expression).evaluate(xmlDocument, XPathConstants.NODESET);

We can compile an XPath expression passed as string and define what kind of data we are expecting to receive such a NODESET, NODE or String for example.

3. Lets Start

Now that we took a look to the base components we will use, lets start with some code using some simple XML, for testing purposes:

<?xml version="1.0"?>
<Tutorials>
    <Tutorial tutId="01" type="java">
        <title>Guava</title>
  <description>Introduction to Guava</description>
  <date>04/04/2016</date>
  <author>GuavaAuthor</author>
    </Tutorial>
    <Tutorial tutId="02" type="java">
        <title>XML</title>
  <description>Introduction to XPath</description>
  <date>04/05/2016</date>
  <author>XMLAuthor</author>
    </Tutorial>
</Tutorials>

3.1. Retrieve a Basic List of Elements

The first method is a simple use of an XPath expression to retrieve a list of nodes from the XML:

FileInputStream fileIS = new FileInputStream(this.getFile());
DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = builderFactory.newDocumentBuilder();
Document xmlDocument = builder.parse(fileIS);
XPath xPath = XPathFactory.newInstance().newXPath();
String expression = "/Tutorials/Tutorial";
NodeList nodeList = (NodeList) xPath.compile(expression).evaluate(xmlDocument, XPathConstants.NODESET);

We can retrieve the tutorial list contained in the root node by using the expression above, or by using the expression “*//Tutorial*” but this one will retrieve all nodes in the document from the current node no matter where they are located in the document, this means at whatever level of the tree starting from the current node.

The NodeList it returns by specifying NODESET to the compile instruction as return type, is an ordered collection of nodes that can be accessed by passing an index as parameter.

3.2. Retrieving a Specific Node by Its ID

We can look for an element based on any given id just by filtering:

DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = builderFactory.newDocumentBuilder();
Document xmlDocument = builder.parse(this.getFile());
XPath xPath = XPathFactory.newInstance().newXPath();
String expression = "/Tutorials/Tutorial[@tutId=" + "'" + id + "'" + "]";
node = (Node) xPath.compile(expression).evaluate(xmlDocument, XPathConstants.NODE);

By using this kind of expressions we can filter for whatever element we need to look for just by using the correct syntax. These kind of expressions are called predicates and they are an easy way to locate specific data over a document, for example:

/Tutorials/Tutorial[1]

/Tutorials/Tutorial[first()]

/Tutorials/Tutorial[position()<4]

You can find a complete reference of predicates here

3.3. Retrieving Nodes by a Specific Tag Name

Now we’re going further by introducing axes, lets see how this works by using it in an XPath expression:

Document xmlDocument = builder.parse(this.getFile());
this.clean(xmlDocument);
XPath xPath = XPathFactory.newInstance().newXPath();
String expression = "//Tutorial[descendant::title[text()=" + "'" + name + "'" + "]]";
NodeList nodeList = (NodeList) xPath.compile(expression).evaluate(xmlDocument, XPathConstants.NODESET);

With the expression used above, we are looking for every element who has a descendant </em> with the text passed as parameter in the “name” variable.</p> <p>Following the sample xml provided for this article, we could look for a <em><title></em> containing the text “Guava” or “XML” and we will retrieve the whole <em><Tutorial></em> element with all its data.</p> <p>Axes provide a very flexible way to navigate an XML document and you can find a full documentation it the <a href="https://www.w3.org/TR/xpath/#axes">official site</a>.</p> <h3 id="34-manipulating-data-in-expressions"><strong>3.4. Manipulating Data in Expressions</strong></h3> <p>XPath allows us to manipulate data too in the expressions if needed.</p> <pre><code class="language-java">XPath xPath = XPathFactory.newInstance().newXPath(); String expression = "//Tutorial[number(translate(date, '/', '')) > " + date + "]"; nodeList = (NodeList) xPath.compile(expression).evaluate(xmlDocument, XPathConstants.NODESET); </code></pre> <p>In this expression we are passing to our method a simple string as a date that looks like “ddmmyyyy” but the XML stores this data with the format “<em>dd/mm/yyyy</em>“, so to match a result we manipulate the string to convert it to the correct data format used by our document and we do it by using one of the functions provided by <a href="https://www.w3.org/TR/xpath/#corelib">XPath</a></p> <h3 id="35-retrieving-elements-from-a-document-with-namespace-defined"><strong>3.5. Retrieving Elements from a Document With Namespace Defined</strong></h3> <p>If our xml document has a namespace defined as it is in the example_namespace.xml used here, the rules to retrieve the data we need are going to change since our xml starts like this:</p> <pre><code class="language-xml"><?xml version="1.0"?> <Tutorials xmlns="/full_archive"> </Tutorials> </code></pre> <p>Now when we use an expression similar to “*//Tutoria*l”, we are not going to get any result. That XPath expression is going to return all <em><Tutorial></em> elements that aren’t under any namespace, and in our new example_namespace.xml, all <em><Tutorial></em> elements are defined in the namespace <em>/full_archive</em>.</p> <p>Lets see how to handle namespaces.</p> <p>First of all we need to set the namespace context so XPath will be able to know where are we looking for our data:</p> <pre><code class="language-java">xPath.setNamespaceContext(new NamespaceContext() { @Override public Iterator getPrefixes(String arg0) { return null; } @Override public String getPrefix(String arg0) { return null; } @Override public String getNamespaceURI(String arg0) { if ("bdn".equals(arg0)) { return "/full_archive"; } return null; } }); </code></pre> <p>In the method above, we are defining “<em>bdn</em>” as the name for our namespace “*/full_archive<em>“, and from now on, we need to add “</em>bdn*” to the XPath expressions used to locate elements:</p> <pre><code class="language-java">String expression = "/bdn:Tutorials/bdn:Tutorial"; NodeList nodeList = (NodeList) xPath.compile(expression).evaluate(xmlDocument, XPathConstants.NODESET); </code></pre> <p>Using the expression above we are able to retrieve all <em><Tutorial></em> elements under “<em>bdn</em>” namespace.</p> <h3 id="36-avoiding-empty-text-nodes-troubles"><strong>3.6. Avoiding Empty Text Nodes Troubles</strong></h3> <p>As you could notice, in the code at the 3.3 section of this article a new function is called just right after parsing our XML to a Document object, <em>this.clean(xmlDocument);</em></p> <p>Sometimes when we iterate through elements, childnodes and so on, if our document has empty text nodes we can find an unexpected behavior in the results we want to get.</p> <p>We called <em>node.getFirstChild()</em> when we are iterating over all <em><Tutorial></em> elements looking for the <em><title></em> information, but instead of what we are looking for we just have “#Text” as an empty node.</p> <p>To fix the problem we can navigate through our document and remove those empty nodes, like this:</p> <pre><code class="language-java">NodeList childNodes = node.getChildNodes(); for (int n = childNodes.getLength() - 1; n >= 0; n--) { Node child = childNodes.item(n); short nodeType = child.getNodeType(); if (nodeType == Node.ELEMENT_NODE) { clean(child); } else if (nodeType == Node.TEXT_NODE) { String trimmedNodeVal = child.getNodeValue().trim(); if (trimmedNodeVal.length() == 0){ node.removeChild(child); } else { child.setNodeValue(trimmedNodeVal); } } else if (nodeType == Node.COMMENT_NODE) { node.removeChild(child); } } </code></pre> <p>By doing this we can check each type of node we find and remove those ones we don’t need.</p> <h2 id="4-conclusions"><strong>4. Conclusions</strong></h2> <p>Here we just introduced the default XPath provided support, but there are many popular libraries as JDOM, Saxon, XQuery, JAXP, Jaxen or even Jackson now. There are libraries for specific HTML parsing too like JSoup.</p> <p>It’s not limited to java, XPath expressions can be used by XSLT language to navigate XML documents.</p> <p>As you can see, there is a wide range of possibilities on how to handle these kind of files.</p> <p>There is a great standard support by default for XML/HTML documents parsing, reading and processing. You can find the full working sample <a href="https://github.com/eugenp/tutorials/tree/master/xml">here</a>.</p>