Skip to content

2. XML

Introduction

Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

The design goals of XML emphasize simplicity, generality, and usability across the Internet.It is a textual data format with strong support via Unicode for different human languages. Although the design of XML focuses on documents, the language is widely used for the representation of arbitrary data structures such as those used in web services.

World Web Consortium

The World Wide Web Consortium (W3C) is the main international standards organization for the World Wide Web. Founded in 1994 and currently led by Tim Berners-Lee, the consortium is made up of member organizations that maintain full-time staff working together in the development of standards for the World Wide Web. As of 21 October 2019, W3C had 443 members. W3C also engages in education and outreach, develops software and serves as an open forum for discussion about the Web.

XML is a markup language similar to HTML, but without predefined tags to use. Instead, you define your own tags (extensible) designed specifically for your needs. This is a powerful way to store data in a format that can be stored, searched, and shared. Most importantly, since the fundamental format of XML is standardized, if you share or transmit XML across systems or platforms, either locally or over the internet, the recipient can still parse the data due to the standardized XML syntax.

These are some languages based on XML:

  • GML (Geography Markup Language).
  • MathML (Mathematical Markup Language).
  • RSS (Really Simple Syndication).
  • SVG (Scalable Vector Graphics).
  • XHTML (eXtensible HyperText Markup Language).

Key terminology

The material in this section is based on the XML Specification. This is not an exhaustive list of all the constructs that appear in XML; it provides an introduction to the key constructs most often encountered in day-to-day use.

Character

An XML document is a string of characters. Almost every legal Unicode character may appear in an XML document.

Processor and application

The processor analyzes the markup and passes structured information to an application. The specification places requirements on what an XML processor must do and not do, but the application is outside its scope. The processor (as the specification calls it) is often referred to colloquially as an XML parser.

Markup and content

The characters making up an XML document are divided into markup and content, which may be distinguished by the application of simple syntactic rules. Generally, strings that constitute markup either begin with the character < and end with a >, or they begin with the character & and end with a ;. Strings of characters that are not markup are content. However, in a CDATA section, the delimiters <![CDATA[ and ]]> are classified as markup, while the text between them is classified as content. In addition, whitespace before and after the outermost element is classified as markup.

Tag

A tag is a markup construct that begins with < and ends with >. Tags come in three flavors:

  • start-tag, such as <section>;
  • end-tag, such as </section>;
  • empty-element tag, such as <line-break />.

Element

An element is a logical document component that either begins with a start-tag and ends with a matching end-tag or consists only of an empty-element tag. The characters between the start-tag and end-tag, if any, are the element's content, and may contain markup, including other elements, which are called child elements. An example is <greeting>Hello, world!</greeting>. Another is <line-break />.

Attribute

An attribute is a markup construct consisting of a name–value pair that exists within a start-tag or empty-element tag. An example is <img src="madonna.jpg" alt="Madonna" />, where the names of the attributes are "src" and "alt", and their values are "madonna.jpg" and "Madonna" respectively. Another example is <step number="3">Connect A to B.</step>, where the name of the attribute is "number" and its value is "3". An XML attribute can only have a single value and each attribute can appear at most once on each element. In the common situation where a list of multiple values is desired, this must be done by encoding the list into a well-formed XML attribute with some format beyond what XML defines itself. Usually this is either a comma or semi-colon delimited list or, if the individual values are known not to contain spaces, a space-delimited list can be used. <div class="inner greeting-box">Welcome!</div>, where the attribute "class" has both the value "inner greeting-box" and also indicates the two CSS class names "inner" and "greeting-box".

XML declaration

XML documents may begin with an XML declaration that describes some information about themselves. An example is <?xml version="1.0" encoding="UTF-8"?>.

Entities

Like HTML, XML offers methods (called entities) for referring to some special reserved characters (such as a greater than sign which is used for tags). There are five of these characters that you should know:

Character Description Entity
< lt (less than) &lt;
> gt (greater than) &gt;
" quot (quotation mark) &quot;
' apos (apostrophe) &apos;
& (ampersand) amp (ampersand) &amp;

Given the "entities.xml" file:

<? xml version = "1.0" encoding = "UTF-8"?>
<entities>
   <less_than> &lt; </less_than>
   <greater_than> &gt; </greater_than>
   <double_quote> &quot; </double_quote>
   <simple_quote> &apos; </simple_quote>
   <ampersand> &amp; </ampersand>
</entities>

When you open it in Google Chrome you can see:

Viewing the entities.xml file in Google Chrome - Example of the Abrirllave.com XML tutorial

In the web browser, you can see where the references to entities have been written in the XML document (for example), the corresponding characters are displayed (for example <).

Structure and syntax

XML documents are composed by plain text and tags defined by developers.

Elements are represented by tags. If we want to save people's name we should write:

<name>Elsa</name>
This is the basic syntax in order to write an XML elemente:

<tag>text</tag>
It is important to remark that between the start tag (<stag>) and the end tag (</tag>) we have written the data (text) we would storage. Elsa in the example.

Empty tags

In a XML document an element could not contain any value. If so, we should write:

<tag></tag>
A simpler alternative way of writing this is kind of elements is:

<tag />

To write an empty element name, we should write:

<name></name>
Or:

<name />

Relation between parent and childs

An parent element could contain one or many elements:

<people>
   <name>Elsa</name>
   <woman />
   <birthday>
      <day>18</day>
      <month>6</month>
      <year>1996</year>
   </birthday>
   <city>Pamplona</city>
</people>

In this example, the people element contains four elements (children): "name", "woman", "birthday" and "city". In addition, the "birthday" element contains three elements (children): "day", "month" and "year".

Notice that of all the elements in this example, only the "woman" element is empty.

Root element

Each XML document has exactly one single root element. It encloses all the other elements and is therefore the sole parent element to all the other elements. ROOT elements are also called document elements. In HTML, the root element is the <html> element. In our example, the people element is the document root.

Graphically, we could represent it:

graph TD A[people] --> B[name] A[people] --> C[woman] A[people] --> D[birthday] A[people] --> E[city] D --> F[day] D --> G[month] D --> H[year]

In this way, the structure of any XML document can be represented as an inverted tree of elements. It is said that the elements are the ones that give semantic structure to the document.

Elements with mixed content

An element type has mixed content when elements of that type may contain character data, optionally interspersed with child elements. In this case, the types of the child elements may be constrained, but not their order or their number of occurrences:

<description>
   character data <br/> more text <br/> 
   and <strong>more data</strong>
</description>
Mixed content is a very effective way of marking up textual data, the most obvious example being HTML.

XML Syntax Rules

You must follow these rules when you create XML syntax:

  • All XML elements must have a closing tag.
  • XML tags are case sensitive.
  • All XML elements must be properly nested.
  • All XML documents must have a root element.
  • Attribute values must always be quoted.

All XML elements must have a closing tag

It is illegal to omit the closing tag when you are creating XML syntax. XML elements must have a closing tag.

Incorrect:

<body>See Spot run.
<body>See Spot catch the ball.

Correct:

<body>See Spot run.</body>
<body>See Spot catch the ball.</body>

All XML elements must be properly nested

Improper nesting of tags makes no sense to XML.

Incorrect:

<b><i>This text is bold and italic.</b></i>

Correct:

<b><i>This text is bold and italic.</i></b>

All XML documents must have a root element

All XML documents must contain a single tag pair to define a root element. All other elements must be within this root element. All elements can have sub elements (child elements). Sub elements must be correctly nested within their parent element.

Example:

<root>
  <child>
    <subchild>.....</subchild>
  </child>
</root>

Attribute values must always be quoted

It is illegal to omit quotation marks around attribute values. XML elements can have attributes in name/value pairs; however, the attribute value must always be quoted.

Incorrect:

<?xml version= “1.0” encoding=“ISO-8859-1”?>
<note date=05/05/05>
<to>Dick</to>
<from>Jane</from>
</note>
Correct:

<?xml version= “1.0” encoding=“ISO-8859-1”?>
<note date=”05/05/05”>
<to>Dick</to>
<from>Jane</from>
</note>

In the incorrect document, the date attribute in the note element is not quoted.

XML tags are case sensitive

When you create XML documents, the tag is different from the tag .

Incorrect:

<Body>See Spot run.</body>

Correct:

<body>See Spot run.</body>

In addition, they have to fullfil the follow rules:

  • They can contain lowercase letters, uppercase letters, numbers, periods ".", hyphens "-" and underscores "_".
  • They can also contain the colon ":". However, its use is reserved for when define namespaces.
  • The first character must be a letter or a hyphen under "_".

The following elements are breaking some rules:

<ciudad>Pamplona</ciudad>
<día>18</dia>
<mes>6<mes/>
<ciudad>Pamplona</finciudad>
<_rojo>
<2colores>Rojo y Naranja</2colores>
<persona><nombre>Elsa</persona></nombre>
<color favorito>azul</color favorito>

We must write them:

<ciudad>Pamplona</ciudad>
<día>18</día>
<mes>6</mes>
<ciudad>Pamplona</ciudad>
<_rojo/>
<colores2>Rojo y Naranja</colores2>
<Aficiones >Cine, Bailar, Nadar</Aficiones >
<persona><nombre>Elsa</nombre></persona>
<color.favorito>azul</color.favorito>
<color-favorito>azul</color-favorito>
<color_favorito>azul</color_favorito>
Non-English letters (á, Á, ñ, Ñ ...) are allowed. However, it is advisable not to use them in order to reduce possible incompatibilities with programs that may not recognize them.

As for the hyphen - and period .characters, although they are also allowed to name tags, it is also advisable to avoid its use.

Attributes in XML

Elements of an XML document can have attributes defined in the start tag. An attribute serves to provide extra information about the item that contains it.

Given the following data for a product:

  • Code: G45
  • Name: Wool hat
  • Color: black
  • Price: 12.56

Its representation in an XML document could be, for example:

<product code = "G45">
   <name color = "black" price = "12.56"> Wool hat </ name>
</ product>
In this example three attributes have been written: code, color and price. It should be noted that, its values ​​("G45", "black" and "12.56") have been enclosed in double quotes ("). However, they can also be enclosed in single quotes (').

If, for example, the code attribute were to be represented as an element, it could be written:

<product>
   <code>G45</code>
   <name color="black" price="12.56">Wool hat</ name>
</ product>
As you can see, now the value of the code has not been written in double quotes.

Elements and attributes

An element is a logical component of an XML document. The elements usually have their own entity. The content of an item is everything between the opening and closing tags, even if they contain other elements (children).

In contrast, attributes usually represent properties or characteristics of elements.

Syntax rules

Attribute names must meet the same syntax rules as element names. In addition, all attributes of an element must be unique. For example, it is incorrect to write:

<data x="3" x="4" i="5" />

However, it is correct to write:

<data x="3" X="4" i="5" />
Attributes contained in an element, as in this case x, X and y, must be separated by whitespace, and is not significant his order.

XML declaration

The XML declaration that can be written at the beginning of an XML document begins with the characters <? and ends with ?>.

Version and coding

An XML document could contain the following XML declaration:

<?xml version="1.0" encoding="UTF-8"?>
In this XML statement, you are indicating that 1.0 is the version of XML used in the document and UTF-8 (8-bit Unicode Transformation Format, 8-bit Unicode Transformation Format) is the character encoding employee.

An XML declaration is not required to appear in an XML document. However, if it includes it, it must appear in the first line of the document, and the "<" character must be the first of that line, that is, blank spaces cannot appear before.

standalone attribute

In an XML declaration, in addition to the version and encoding of the attributes, the standalone attribute can also be written, which can take two values ​​("yes" or "no"):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
Typing standalone =" yes " indicates that the document is independent of others, such as one DTD (Document Type Definition, external Definition (or we will see later). Otherwise, would mean that the document is not independent.

In an XML document, writing the XML declaration is optional. But, if written, the version attribute is required. However, the encoding andstandalone attributes are optional, and by default their values ​​are "UTF-8" and "no". respectively.

On the other hand, when typing the encoding attribute, it should always appear after the version. And, the attribute standalone, as long as it exists, must be in the last place.

Problematic characters in XML: less than (<) and ampersand (&)

In an XML document, the "<" character is problematic because it indicates the beginning of a tag. So, instead of writing, for example:

<condition>a < b</condition>
The entity reference should be used by typing:

<condition>a &lt; b </condition>
The > character can be used in the text contained in an element, and it is not incorrect to write, for example:

<condition>a > b</condition>
However, it is recommended to use your entity reference (&gt;).

In an XML document, the ampersand character is also problematic, as it is used to indicate the beginning of an reference to entity. For example, it is incorrect to write:

<condition> a==1 && b==2 </condition>
Instead, write the following:
<condition> a==1 &amp;&amp; b==2 </condition>

Character references in XML

Unicode character references with &# symbols can be written in an XML document, followed by the decimal value or hexadecimal of the Unicode character you want to represent and finally adding the semicolon character ";".

Representation of the Euro character (€) in XML

Given the XML document "products.xml":

<? xml version = "1.0" encoding = "UTF-8"?>
<products>
   <name price = "12.56&#8364;"> Wool hat </name>
   <name price = "16.99&#x20AC;"> Fleece cap </name>
</products>

When viewing in a web browser, you can see the following:

View products.xml file in Google Chrome - Example of {Abrirllave.comXML tutorial

It should be noted that, in this case, to represent the symbol of the Euro (€), its value has been used for the first time. decimal (&#8364;) in Unicode and, the second time, its hexadecimal value (&#x20AC;).

Comments in XML

To write comments to an XML document, they must be written between the characters <!-- and -->. For example:

<!-- This is a comment written in an XML document -->

Given the XML file "letras.xml":

<?xml version = "1.0" encoding = "UTF-8"?>
<!-- Example use of comments .-->
<a>
   <b>
      <c quantity="4">cccc</c>
      <d quantity="2">dd</d>
   </b>
   <e>
      <f quantity="8">ffffffff</f>
      <!-- g may appear several times -->
      <g quantity="5">ggggg </g>
      <g quantity="2">gg </g>
   </e>
</a>
In a browser you will see:

Displaying the letras.xml file in Google Chrome - Example of the {Abrirllave.com XML tutorial

In an XML document, comments cannot be written within tags. For example, it is incorrect to write:

<element <!-- empty element --> />
On the other hand, it should be noted that in the comments of an XML document it is not allowed to use two consecutive hyphens:

<!-- two hyphens in a row - in a comment gives error -->
So, it is not possible to nest comments in an XML document.

CDATA sections in XML

An XML document can contain CDATA (Character DATA) sections for writing text that is not intended to be parsed. For example, this can be useful when you want to type text that contains any of the problematic characters: less than < o ampersand &.

In an XML document, to include a CDATA section, we must start with the character string

<![CDATA [
and ending with the characters
]]>.

A CDATA section may contain, for example, the source code of a program written in the C language:

<? xml version = "1.0" encoding = "UTF-8"?>
<CDATA_example>
<![CDATA [
#include <stdio.h>
int main ()
{
   float note;
   printf ("\n Enter note (real):");
   scanf ("%f", &note);
   if (5 <= note)
      printf ("\n APPROVED");
   return 0;
}
]]>
</CDATA_example>
A web browser will display something like:

View example_cdata.xml file in Google Chrome - Example of the {Abrirllave.com XML tutorial

The string "]]>" cannot be written within a CDATA section. Consequently, CDATA sections cannot be nested.

On the other hand, it is not allowed to write whitespace or line breaks in start strings "<![CDATA [" or end "]]>" of a CDATA section

Editing tools

To edit XML documents it is enough to have a plain text editor, such as Notepad or Pluma, but we can also use some specific editors like XML Copy Editor or an IDE with some plugin like Visual Studio Code with XML Tools and XML by Red Hat.

Preparation of well-formed XML documents

An XML document is said to be well-formed when it has no syntax errors. This includes the following aspects:

  1. Element names and their attributes must be spelled correctly.
  2. Attribute values ​​must be enclosed in double or single quotes.
  3. The attributes of an element must be separated by whitespace.
  4. References to entities should be used where necessary.
  5. There must be a single root element.
  6. Every element must have a parent element except the root element.
  7. All elements must have an opening tag and a closing tag.
  8. Tags must be nested correctly.
  9. The XML declaration must be in the first line written correctly.
  10. The CDATA sections and comments must be spelled correctly.

Using namespaces in XML

XML namespaces are a mechanism for ensuring that the elements and attributes of an XML document have names unique. They are defined in a W3C recommendation. The problem they solve is the ambiguity that arises when a document XML contains names of elements or attributes from various vocabularies and results in various elements or attributes homonyms (with the same name): if each vocabulary is given a different namespace, the ambiguity is resolved.

For example:

 <menu>
    <option>Save</option>
    <description>Save the current document</description>
 </menu>
 <menu>
    <meats>
       <veal_steak price="12.95" />
       <sirloin_staeck price="13.60" />
    </meats>
    <fishes>
       <baked_salmon  price="16.20" />
       <hake_in_green_sauce price="15.85" />
    </fishes>
 </menu>

So, if the <menu> elements are included in an XML document, a name conflict arises. To solve it, namespaces can be used. For example, typing:

<?xml version = "1.0" encoding = "UTF-8"?>
<e1:example xmlns:e1="http://www.abrirllave.com/example1"
   xmlns:e2="http://www.abrirllave.com/ejemplo2">  

 <e1:menu>
    <e1:option>Save</e1:option>
    <e1:description>Save the current document</e1:description>
 </e1:menu>

 <e2:menu>
    <e2:meats>
       <e2:veal_steak price = "12.95" />
       <e2:sirloin_staeck price = "13.60" />
    </e2:meats>
    <e2:fishes>
       <e2:baked_salmon  price = "16.20" />
       <e2:hake_in_green_sauce price = "15.85" />
    </e2:fishes>
 </e2:menu>
</e1:example>

The following syntax is used to define a namespace:

xmlns: prefix="URI"

In the example, notice that, xmlns is an attribute that was used in the start tag of the element<example>and, in this case, two namespaces have been defined that refer to the following URIs (Uniform Resource Identifier):

  • http://www.abrirllave.com/example1
  • http://www.abrirllave.com/example2

The defined prefixes are e1 and e2, respectively. Prefixes have been added to the tags that appear in the document: <e1: menu>, <e2: menu>, <e1: option>, etc.

Defining a default namespace

Alternatively, a default namespace can be defined using the following syntax:

xmlns="URI"

In this way, both the element where the namespace has been defined, and all its successors (children, children of children, etc.), must belong to this namespace. For example:

<?xml version = "1.0" encoding = "UTF-8"?>
<example xmlns="http://www.abrirllave.com/example1">
  <menu>
    <option>Save</option>
    <description>Save the current document</description>
 </menu>
In the following example, a default namespace is initially defined for the <example> element and its contents. in it. However, a second namespace is then defined, which by default affects the second <menu> element that appears in the document and its successors:,,` ...

<?xml version = "1.0" encoding = "UTF-8"?>
<example xmlns="http://www.abrirllave.com/example1">  

 <menu>
    <option>Save</option>
    <description>Save the current document</description>
 </menu>

 <menu xmlns="http://www.abrirllave.com/example2">
    <meats>
       <veal_steak price = "12.95" />
       <sirloin_staeck price = "13.60" />
    </meats>
    <fishes>
       <baked_salmon  price = "16.20" />
       <hake_in_green_sauce price = "15.85" />
    </fishes>
 </menu>
</example>

In an XML document, to indicate that certain elements -or all- do not belong to any namespace, the attribute is written xmlns empty, that is,xmlns="".

<?xml version = "1.0" encoding = "UTF-8"?>
<example xmlns="http://www.abrirllave.com/example1">  

 <menu>
    <option>Save</option>
    <description>Save the current document</description>
 </menu>

 <menu xmlns="http://www.abrirllave.com/example2">
    <meats>
       <veal_steak price = "12.95" />
       <sirloin_staeck price = "13.60" />
    </meats>
    <fishes xmlns="">
       <baked_salmon  price = "16.20" />
       <hake_in_green_sauce price = "15.85" />
    </fishes>
 </menu>
</example>
In this case, the element and its children do not belong in any namespace.

Bibliography, webography and credits