Skip to content

XPath

XPath is a simple language for identifying parts of an XML document that are of interest. It is used by XSLT, and also by XLink; it is extended considerably by XQuery.

XLink

XLink is used to create hyperlinks in XML documents.

XPath cannot be used stand-alone: it is always used in the context of a host language, whether that language is XSLT, Python, PHP, C#, JavaScript...

XPath can be very powerful: for example, to find all div elements that have a class attribute called sock with a span child whose class attribute is equal to colour, one might write,

div[@class="sock"]/span[@class="colour"]

XPath uses a path notation (as in URLs) for navigating through the hierarchical structure of an XML document. It uses a non-XML syntax so that it can be used in URIs and XML attribute values.

XPath 3.1 became a Recommendation on 21 March 2017.

Document tree

XPath treat XML documents as trees of nodes. The topmost element of the tree is called the root element:

bg right 90% bg 100% bg 100% bg 100%

Node types

In XPath, nodes can include more than just elements. There are seven types of nodes.

  1. root element
  2. elements
  3. attributes
  4. text nodes
  5. namespaces
  6. processing instructions
  7. comments

XPath expression syntax

XPath expressions are written in text and describe the path you want to follow. To use XPath, the document must be well-structured. There are two ways to write XPath expressions: shorthand, which is simpler and easier to understand, and full syntax, which is more complex but has more options. We will learn the first one.

When you evaluate an XPath expression, the system will look for nodes in the document that match the path you described. The result of the evaluation will be all the nodes that match the path. An XPath expression can be divided into search steps, each of which has three parts:

  1. The axis, which determines if you are looking for element or attribute nodes based on their names.
  2. The predicate, which restricts the axis selection to nodes that meet certain conditions.
  3. The node selection, which chooses the elements or text contained in the selected nodes.

XML sample

<?xml version="1.0" encoding="UTF-8"?>
<library>
  <book>
    <title>Life is Elsewhere</title>
    <author>Milan Kundera</author>
    <publicationDate year="1973"/>
  </book>
  <book>
    <title>Pantaleon and the Visitors</title>
    <author birthDate="03/28/1936">Mario Vargas Llosa</author>
    <publicationDate year="1973"/>
  </book>
  <book>
    <title>Conversation in the Cathedral</title>
    <author birthDate="03/28/1936">Mario Vargas Llosa</author>
    <publicationDate year="1969"/>
  </book>
</library>

bg 89%

Node tree representation of the above XML document

Axis

The axis allows us to select a subset of nodes. Element nodes are indicated by the element name. Attribute nodes are indicated by @ and the name.

/: at the beginning of the expression, indicates the root node, otherwise indicates "child". It must be followed by the name of an element or attribute.

/biblioteca/libro/autor

<autor>Milan Kundera</autor>
<autor fechaNacimiento="28/03/1936">Mario Vargas Llosa</autor>
<autor fechaNacimiento="28/03/1936">Mario Vargas Llosa</autor>

/autor

Returns nothing because <author> is not a child of the root node.

/biblioteca/autor

Returns nothing because <author> is not a child of <biblioteca>.

/biblioteca/libro/autor/@fechaNacimiento

 fechaNacimiento="28/03/1936"
 fechaNacimiento="28/03/1936"

/biblioteca/libro/@fechaNacimiento

Returns nothing because fechaNacimiento is an attribute 
of <author> not of <libro>.

//: indicates "descendant" (children, children of children, etc.).

/biblioteca//autor

<autor>Milan Kundera</autor>
<autor fechaNacimiento="28/03/1936">Mario Vargas Llosa</autor>
<autor fechaNacimiento="28/03/1936">Mario Vargas Llosa</autor>

//autor

<autor>Milan Kundera</autor>
<autor fechaNacimiento="28/03/1936">Mario Vargas Llosa</autor>
<autor fechaNacimiento="28/03/1936">Mario Vargas Llosa</autor>

//autor//libro

Returns nothing because <libro> is not a child of <autor>.
  • /..: indicates the parent element.

//@año

 año="1973"
 año="1973"
 año="1969"

/biblioteca/libro/autor/@fechaNacimiento/..

<autor fechaNacimiento="28/03/1936">Mario Vargas Llosa</autor>
<autor fechaNacimiento="28/03/1936">Mario Vargas Llosa</autor>

//@fechaNacimiento/../..

<libro>
    <titulo>Pantaleón y las visitadoras</titulo>
    <autor fechaNacimiento="28/03/1936">Mario Vargas Llosa</autor>
    <fechaPublicacion año="1973"/>
  </libro>

<libro>
    <titulo>Conversación en la catedral</titulo>
    <autor fechaNacimiento="28/03/1936">Mario Vargas Llosa</autor>
    <fechaPublicacion año="1969"/>
  </libro>
  • |: allows to indicate several routes.

//autor|//titulo

<titulo>La vida está en otra parte</titulo>
<autor>Milan Kundera</autor>
<titulo>Pantaleón y las visitadoras</titulo>
<autor fechaNacimiento="28/03/1936">Mario Vargas Llosa</autor>
<titulo>Conversación en la catedral</titulo>
<autor fechaNacimiento="28/03/1936">Mario Vargas Llosa</autor>

//autor|//titulo|//@año

<titulo>La vida está en otra parte</titulo>
<autor>Milan Kundera</autor>
 año="1973"
<titulo>Pantaleón y las visitadoras</titulo>
<autor fechaNacimiento="28/03/1936">Mario Vargas Llosa</autor>
 año="1973"
<titulo>Conversación en la catedral</titulo>
<autor fechaNacimiento="28/03/1936">Mario Vargas Llosa</autor>
 año="1969"

Predicate

The predicate is written in square brackets, following the axis.

If the axis has selected some nodes, the predicate allows you to restrict that selection to those that meet certain conditions.

  • [@attribute]: select the elements that have the attribute.
  • [number]: if there are several results, select one of them by order number; last() selects the last one.
  • [condition]: select the nodes that meet the condition.

Conditions

Predicates allow you to define conditions on attribute values. The following operators can be used in the conditions:

  • logical operators: and, or, not()
  • arithmetic operators: +, -, *, div, mod
  • comparison operators: =, !=, <, >, <=, >=
//fechaPublicacion[@año>1970]

//libro[autor="Mario Vargas Llosa"]

//@año[.>1970]

//autor[.="Mario Vargas Llosa"]

//libro[autor="Mario Vargas Llosa" and fechaPublicacion/@año="1973"]

//libro[autor="Mario Vargas Llosa" and fechaPublicacion/@año="1973"]

Node selection

The node selection is written after the axis and the predicate. If the axis and the predicate have selected some nodes, the selection of nodes indicates what part of those nodes we keep.

  • /node(): Select all children (elements or text) of the node.
  • //node(): Select all descendants (elements or text) of the node.

  • /text(): Select only the text contained in the node.

  • //text(): Select only the text contained in the node and all its descendants.
  • /*: Select all children (elements only) of the node.
  • //*: Select all descendants (elements only) of the node.
  • /@*: Select all attributes of the node.
  • //@*: Select all attributes of the node's descendants.

bg 75%

A node tree sample

Tools

To evaluate XPath expressions we can use xmllint --xpath expr or VSCode extension: XSLT/XPath for Visual Studio Code.

Bibliography, webography and credits