XPath¶
XPath is a simple language for identifying parts of an XML document that are of interest. It is used by XSLT, and also by XLink; it is extended considerably by XQuery.
XLink
XLink is used to create hyperlinks in XML documents.
XPath cannot be used stand-alone: it is always used in the context of a host language, whether that language is XSLT, Python, PHP, C#, JavaScript...
XPath can be very powerful: for example, to find all div
elements that have a class attribute called sock
with a span
child whose class attribute is equal to colour
, one might write,
div[@class="sock"]/span[@class="colour"]
XPath uses a path notation (as in URLs) for navigating through the hierarchical structure of an XML document. It uses a non-XML syntax so that it can be used in URIs and XML attribute values.
XPath 3.1 became a Recommendation on 21 March 2017.
Document tree¶
XPath treat XML documents as trees of nodes. The topmost element of the tree is called the root element:
Node types¶
In XPath, nodes can include more than just elements. There are seven types of nodes.
- root element
- elements
- attributes
- text nodes
- namespaces
- processing instructions
- comments
XPath expression syntax¶
XPath expressions are written in text and describe the path you want to follow. To use XPath, the document must be well-structured. There are two ways to write XPath expressions: shorthand, which is simpler and easier to understand, and full syntax, which is more complex but has more options. We will learn the first one.
When you evaluate an XPath expression, the system will look for nodes in the document that match the path you described. The result of the evaluation will be all the nodes that match the path. An XPath expression can be divided into search steps, each of which has three parts:
- The axis, which determines if you are looking for element or attribute nodes based on their names.
- The predicate, which restricts the axis selection to nodes that meet certain conditions.
- The node selection, which chooses the elements or text contained in the selected nodes.
XML sample¶
<?xml version="1.0" encoding="UTF-8"?>
<library>
<book>
<title>Life is Elsewhere</title>
<author>Milan Kundera</author>
<publicationDate year="1973"/>
</book>
<book>
<title>Pantaleon and the Visitors</title>
<author birthDate="03/28/1936">Mario Vargas Llosa</author>
<publicationDate year="1973"/>
</book>
<book>
<title>Conversation in the Cathedral</title>
<author birthDate="03/28/1936">Mario Vargas Llosa</author>
<publicationDate year="1969"/>
</book>
</library>
Axis¶
The axis allows us to select a subset of nodes. Element nodes are indicated by the element name. Attribute nodes are indicated by @
and the name.
/
: at the beginning of the expression, indicates the root node, otherwise indicates "child". It must be followed by the name of an element or attribute.
/biblioteca/libro/autor
<autor>Milan Kundera</autor>
<autor fechaNacimiento="28/03/1936">Mario Vargas Llosa</autor>
<autor fechaNacimiento="28/03/1936">Mario Vargas Llosa</autor>
/autor
Returns nothing because <author> is not a child of the root node.
/biblioteca/autor
Returns nothing because <author> is not a child of <biblioteca>.
/biblioteca/libro/autor/@fechaNacimiento
fechaNacimiento="28/03/1936"
fechaNacimiento="28/03/1936"
/biblioteca/libro/@fechaNacimiento
Returns nothing because fechaNacimiento is an attribute
of <author> not of <libro>.
//
: indicates "descendant" (children, children of children, etc.).
/biblioteca//autor
<autor>Milan Kundera</autor>
<autor fechaNacimiento="28/03/1936">Mario Vargas Llosa</autor>
<autor fechaNacimiento="28/03/1936">Mario Vargas Llosa</autor>
//autor
<autor>Milan Kundera</autor>
<autor fechaNacimiento="28/03/1936">Mario Vargas Llosa</autor>
<autor fechaNacimiento="28/03/1936">Mario Vargas Llosa</autor>
//autor//libro
Returns nothing because <libro> is not a child of <autor>.
/..
: indicates the parent element.
//@año
año="1973"
año="1973"
año="1969"
/biblioteca/libro/autor/@fechaNacimiento/..
<autor fechaNacimiento="28/03/1936">Mario Vargas Llosa</autor>
<autor fechaNacimiento="28/03/1936">Mario Vargas Llosa</autor>
//@fechaNacimiento/../..
<libro>
<titulo>Pantaleón y las visitadoras</titulo>
<autor fechaNacimiento="28/03/1936">Mario Vargas Llosa</autor>
<fechaPublicacion año="1973"/>
</libro>
<libro>
<titulo>Conversación en la catedral</titulo>
<autor fechaNacimiento="28/03/1936">Mario Vargas Llosa</autor>
<fechaPublicacion año="1969"/>
</libro>
|
: allows to indicate several routes.
//autor|//titulo
<titulo>La vida está en otra parte</titulo>
<autor>Milan Kundera</autor>
<titulo>Pantaleón y las visitadoras</titulo>
<autor fechaNacimiento="28/03/1936">Mario Vargas Llosa</autor>
<titulo>Conversación en la catedral</titulo>
<autor fechaNacimiento="28/03/1936">Mario Vargas Llosa</autor>
//autor|//titulo|//@año
<titulo>La vida está en otra parte</titulo>
<autor>Milan Kundera</autor>
año="1973"
<titulo>Pantaleón y las visitadoras</titulo>
<autor fechaNacimiento="28/03/1936">Mario Vargas Llosa</autor>
año="1973"
<titulo>Conversación en la catedral</titulo>
<autor fechaNacimiento="28/03/1936">Mario Vargas Llosa</autor>
año="1969"
Predicate¶
The predicate is written in square brackets, following the axis.
If the axis has selected some nodes, the predicate allows you to restrict that selection to those that meet certain conditions.
[@attribute]
: select the elements that have the attribute.[number]
: if there are several results, select one of them by order number;last()
selects the last one.[condition]
: select the nodes that meet the condition.
Conditions¶
Predicates allow you to define conditions on attribute values. The following operators can be used in the conditions:
- logical operators: and, or, not()
- arithmetic operators: +, -, *, div, mod
- comparison operators: =, !=, <, >, <=, >=
//fechaPublicacion[@año>1970]
//libro[autor="Mario Vargas Llosa"]
//@año[.>1970]
//autor[.="Mario Vargas Llosa"]
//libro[autor="Mario Vargas Llosa" and fechaPublicacion/@año="1973"]
//libro[autor="Mario Vargas Llosa" and fechaPublicacion/@año="1973"]
Node selection¶
The node selection is written after the axis and the predicate. If the axis and the predicate have selected some nodes, the selection of nodes indicates what part of those nodes we keep.
/node()
: Select all children (elements or text) of the node.-
//node()
: Select all descendants (elements or text) of the node. -
/text()
: Select only the text contained in the node. //text()
: Select only the text contained in the node and all its descendants./*
: Select all children (elements only) of the node.//*
: Select all descendants (elements only) of the node./@*
: Select all attributes of the node.//@*
: Select all attributes of the node's descendants.
Tools¶
To evaluate XPath expressions we can use xmllint --xpath expr
or VSCode extension: XSLT/XPath for Visual
Studio Code.
Bibliography, webography and credits¶
- Bartolomé Sintés Marco. (2022, January 30). XPath: XML Path language, available at https://www.mclibre.org/consultar/xml/lecciones/xml-xpath.html