2. XML¶
Introduction¶
Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.
The design goals of XML emphasize simplicity, generality, and usability across the Internet.It is a textual data format with strong support via Unicode for different human languages. Although the design of XML focuses on documents, the language is widely used for the representation of arbitrary data structures such as those used in web services.
World Web Consortium
The World Wide Web Consortium (W3C) is the main international standards organization for the World Wide Web. Founded in 1994 and currently led by Tim Berners-Lee, the consortium is made up of member organizations that maintain full-time staff working together in the development of standards for the World Wide Web. As of 21 October 2019, W3C had 443 members. W3C also engages in education and outreach, develops software and serves as an open forum for discussion about the Web.
XML is a markup language similar to HTML, but without predefined tags to use. Instead, you define your own tags (extensible) designed specifically for your needs. This is a powerful way to store data in a format that can be stored, searched, and shared. Most importantly, since the fundamental format of XML is standardized, if you share or transmit XML across systems or platforms, either locally or over the internet, the recipient can still parse the data due to the standardized XML syntax.
These are some languages based on XML:
- GML (Geography Markup Language).
- MathML (Mathematical Markup Language).
- RSS (Really Simple Syndication).
- SVG (Scalable Vector Graphics).
- XHTML (eXtensible HyperText Markup Language).
Key terminology¶
The material in this section is based on the XML Specification. This is not an exhaustive list of all the constructs that appear in XML; it provides an introduction to the key constructs most often encountered in day-to-day use.
Character¶
An XML document is a string of characters. Almost every legal Unicode character may appear in an XML document.
Processor and application¶
The processor analyzes the markup and passes structured information to an application. The specification places requirements on what an XML processor must do and not do, but the application is outside its scope. The processor (as the specification calls it) is often referred to colloquially as an XML parser.
Markup and content¶
The characters making up an XML document are divided into markup and content, which may be distinguished by the application of simple syntactic rules. Generally, strings that constitute markup either begin with the character <
and end with a >
, or they begin with the character &
and end with a ;
. Strings of characters that are not markup are content. However, in a CDATA section, the delimiters <![CDATA[ and ]]>
are classified as markup, while the text between them is classified as content. In addition, whitespace before and after the outermost element is classified as markup.
Tag¶
A tag is a markup construct that begins with <
and ends with >
. Tags come in three flavors:
- start-tag, such as
<section>
; - end-tag, such as
</section>
; - empty-element tag, such as
<line-break />
.
Element¶
An element is a logical document component that either begins with a start-tag and ends with a matching end-tag or consists only of an empty-element tag. The characters between the start-tag and end-tag, if any, are the element's content, and may contain markup, including other elements, which are called child elements. An example is <greeting>Hello, world!</greeting>
. Another is <line-break />
.
Attribute¶
An attribute is a markup construct consisting of a name–value pair that exists within a start-tag or empty-element tag. An example is <img src="madonna.jpg" alt="Madonna" />
, where the names of the attributes are "src"
and "alt"
, and their values are "madonna.jpg"
and "Madonna"
respectively. Another example is <step number="3">Connect A to B.</step>
, where the name of the attribute is "number"
and its value is "3"
. An XML attribute can only have a single value and each attribute can appear at most once on each element. In the common situation where a list of multiple values is desired, this must be done by encoding the list into a well-formed XML attribute with some format beyond what XML defines itself. Usually this is either a comma or semi-colon delimited list or, if the individual values are known not to contain spaces, a space-delimited list can be used. <div class="inner greeting-box">Welcome!</div>
, where the attribute "class"
has both the value "inner greeting-box"
and also indicates the two CSS class names "inner"
and "greeting-box"
.
XML declaration¶
XML documents may begin with an XML declaration that describes some information about themselves. An example is <?xml version="1.0" encoding="UTF-8"?>
.
Entities¶
Like HTML, XML offers methods (called entities) for referring to some special reserved characters (such as a greater than sign which is used for tags). There are five of these characters that you should know:
Character | Description | Entity |
---|---|---|
< | lt (less than) | < |
> | gt (greater than) | > |
" | quot (quotation mark) | " |
' | apos (apostrophe) | ' |
& (ampersand) | amp (ampersand) | & |
Given the "entities.xml" file:
<? xml version = "1.0" encoding = "UTF-8"?>
<entities>
<less_than> < </less_than>
<greater_than> > </greater_than>
<double_quote> " </double_quote>
<simple_quote> ' </simple_quote>
<ampersand> & </ampersand>
</entities>
When you open it in Google Chrome you can see:
In the web browser, you can see where the references to entities have been written in the XML document (for example), the corresponding characters are displayed (for example <).
Structure and syntax¶
XML documents are composed by plain text and tags defined by developers.
Elements are represented by tags. If we want to save people's name we should write:
<name>Elsa</name>
<tag>text</tag>
<stag>
) and the end tag (</tag>
) we have written the data (text
) we would storage. Elsa
in the example.
Empty tags¶
In a XML document an element could not contain any value. If so, we should write:
<tag></tag>
<tag />
To write an empty element name
, we should write:
<name></name>
<name />
Relation between parent and childs¶
An parent element could contain one or many elements:
<people>
<name>Elsa</name>
<woman />
<birthday>
<day>18</day>
<month>6</month>
<year>1996</year>
</birthday>
<city>Pamplona</city>
</people>
In this example, the people
element contains four elements (children): "name", "woman", "birthday" and "city". In addition, the "birthday" element contains three elements (children): "day", "month" and "year".
Notice that of all the elements in this example, only the "woman" element is empty.
Root element¶
Each XML document has exactly one single root element. It encloses all the other elements and is therefore the sole parent element to all the other elements. ROOT elements are also called document elements. In HTML, the root element is the <html>
element. In our example, the people
element is the document root.
Graphically, we could represent it:
In this way, the structure of any XML document can be represented as an inverted tree of elements. It is said that the elements are the ones that give semantic structure to the document.
Elements with mixed content¶
An element type has mixed content when elements of that type may contain character data, optionally interspersed with child elements. In this case, the types of the child elements may be constrained, but not their order or their number of occurrences:
<description>
character data <br/> more text <br/>
and <strong>more data</strong>
</description>
XML Syntax Rules¶
You must follow these rules when you create XML syntax:
- All XML elements must have a closing tag.
- XML tags are case sensitive.
- All XML elements must be properly nested.
- All XML documents must have a root element.
- Attribute values must always be quoted.
All XML elements must have a closing tag¶
It is illegal to omit the closing tag when you are creating XML syntax. XML elements must have a closing tag.
Incorrect:
<body>See Spot run.
<body>See Spot catch the ball.
Correct:
<body>See Spot run.</body>
<body>See Spot catch the ball.</body>
All XML elements must be properly nested¶
Improper nesting of tags makes no sense to XML.
Incorrect:
<b><i>This text is bold and italic.</b></i>
Correct:
<b><i>This text is bold and italic.</i></b>
All XML documents must have a root element¶
All XML documents must contain a single tag pair to define a root element. All other elements must be within this root element. All elements can have sub elements (child elements). Sub elements must be correctly nested within their parent element.
Example:
<root>
<child>
<subchild>.....</subchild>
</child>
</root>
Attribute values must always be quoted¶
It is illegal to omit quotation marks around attribute values. XML elements can have attributes in name/value pairs; however, the attribute value must always be quoted.
Incorrect:
<?xml version= “1.0” encoding=“ISO-8859-1”?>
<note date=05/05/05>
<to>Dick</to>
<from>Jane</from>
</note>
<?xml version= “1.0” encoding=“ISO-8859-1”?>
<note date=”05/05/05”>
<to>Dick</to>
<from>Jane</from>
</note>
In the incorrect document, the date
attribute in the note
element is not quoted.
XML tags are case sensitive¶
When you create XML documents, the tag
is different from the tag .Incorrect:
<Body>See Spot run.</body>
Correct:
<body>See Spot run.</body>
In addition, they have to fullfil the follow rules:
- They can contain lowercase letters, uppercase letters, numbers, periods ".", hyphens "-" and underscores "_".
- They can also contain the colon ":". However, its use is reserved for when define namespaces.
- The first character must be a letter or a hyphen under "_".
The following elements are breaking some rules:
<ciudad>Pamplona</ciudad>
<día>18</dia>
<mes>6<mes/>
<ciudad>Pamplona</finciudad>
<_rojo>
<2colores>Rojo y Naranja</2colores>
<persona><nombre>Elsa</persona></nombre>
<color favorito>azul</color favorito>
We must write them:
<ciudad>Pamplona</ciudad>
<día>18</día>
<mes>6</mes>
<ciudad>Pamplona</ciudad>
<_rojo/>
<colores2>Rojo y Naranja</colores2>
<Aficiones >Cine, Bailar, Nadar</Aficiones >
<persona><nombre>Elsa</nombre></persona>
<color.favorito>azul</color.favorito>
<color-favorito>azul</color-favorito>
<color_favorito>azul</color_favorito>
As for the hyphen -
and period .
characters, although they are also allowed to name tags,
it is also advisable to avoid its use.
Attributes in XML¶
Elements of an XML document can have attributes defined in the start tag. An attribute serves to provide extra information about the item that contains it.
Given the following data for a product:
- Code: G45
- Name: Wool hat
- Color: black
- Price: 12.56
Its representation in an XML document could be, for example:
<product code = "G45">
<name color = "black" price = "12.56"> Wool hat </ name>
</ product>
If, for example, the code
attribute were to be represented as an element, it could be written:
<product>
<code>G45</code>
<name color="black" price="12.56">Wool hat</ name>
</ product>
Elements and attributes¶
An element is a logical component of an XML document. The elements usually have their own entity. The content of an item is everything between the opening and closing tags, even if they contain other elements (children).
In contrast, attributes usually represent properties or characteristics of elements.
Syntax rules¶
Attribute names must meet the same syntax rules as element names. In addition, all attributes of an element must be unique. For example, it is incorrect to write:
<data x="3" x="4" i="5" />
However, it is correct to write:
<data x="3" X="4" i="5" />
XML declaration¶
The XML declaration that can be written at the beginning of an XML document begins with the characters <?
and ends with ?>
.
Version and coding¶
An XML document could contain the following XML declaration:
<?xml version="1.0" encoding="UTF-8"?>
An XML declaration is not required to appear in an XML document. However, if it includes it, it must appear in the first line of the document, and the "<" character must be the first of that line, that is, blank spaces cannot appear before.
standalone attribute¶
In an XML declaration, in addition to the version and encoding of the attributes, the standalone
attribute can also be written,
which can take two values ("yes" or "no"):
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
standalone =" yes "
indicates that the document is independent of others, such as one
DTD (Document Type Definition, external Definition (or we will see later). Otherwise,
would mean that the document is not independent.
In an XML document, writing the XML declaration is optional. But, if written, the version
attribute is required.
However, the encoding
andstandalone
attributes are optional, and by default their values are "UTF-8" and "no".
respectively.
On the other hand, when typing the encoding
attribute, it should always appear after the version. And, the attribute
standalone
, as long as it exists, must be in the last place.
Problematic characters in XML: less than (<) and ampersand (&)¶
In an XML document, the "<" character is problematic because it indicates the beginning of a tag. So, instead of writing, for example:
<condition>a < b</condition>
<condition>a < b </condition>
>
character can be used in the text contained in an element, and it is not incorrect to write, for example:
<condition>a > b</condition>
>
).
In an XML document, the ampersand character is also problematic, as it is used to indicate the beginning of an reference to entity. For example, it is incorrect to write:
<condition> a==1 && b==2 </condition>
<condition> a==1 && b==2 </condition>
Character references in XML¶
Unicode character references with &# symbols can be written in an XML document, followed by the decimal value or hexadecimal of the Unicode character you want to represent and finally adding the semicolon character ";".
Representation of the Euro character (€) in XML
Given the XML document "products.xml":
<? xml version = "1.0" encoding = "UTF-8"?>
<products>
<name price = "12.56€"> Wool hat </name>
<name price = "16.99€"> Fleece cap </name>
</products>
When viewing in a web browser, you can see the following:
It should be noted that, in this case, to represent the symbol of the Euro (€), its value has been used for the first time.
decimal (€
) in Unicode and, the second time, its hexadecimal value (€
).
Comments in XML¶
To write comments to an XML document, they must be written between the characters <!--
and -->
. For example:
<!-- This is a comment written in an XML document -->
Given the XML file "letras.xml":
<?xml version = "1.0" encoding = "UTF-8"?>
<!-- Example use of comments .-->
<a>
<b>
<c quantity="4">cccc</c>
<d quantity="2">dd</d>
</b>
<e>
<f quantity="8">ffffffff</f>
<!-- g may appear several times -->
<g quantity="5">ggggg </g>
<g quantity="2">gg </g>
</e>
</a>
In an XML document, comments cannot be written within tags. For example, it is incorrect to write:
<element <!-- empty element --> />
<!-- two hyphens in a row - in a comment gives error -->
CDATA sections in XML¶
An XML document can contain CDATA (Character DATA) sections for writing text that is not intended to be parsed.
For example, this can be useful when you want to type text that contains any of the problematic characters: less than <
o ampersand &
.
In an XML document, to include a CDATA section, we must start with the character string
<![CDATA [
]]>.
A CDATA section may contain, for example, the source code of a program written in the C language:
<? xml version = "1.0" encoding = "UTF-8"?>
<CDATA_example>
<![CDATA [
#include <stdio.h>
int main ()
{
float note;
printf ("\n Enter note (real):");
scanf ("%f", ¬e);
if (5 <= note)
printf ("\n APPROVED");
return 0;
}
]]>
</CDATA_example>
The string "]]>"
cannot be written within a CDATA section. Consequently, CDATA sections cannot be nested.
On the other hand, it is not allowed to write whitespace or line breaks in start strings "<![CDATA [" or end "]]>" of a CDATA section
Editing tools¶
To edit XML documents it is enough to have a plain text editor, such as Notepad or Pluma, but we can also use some specific editors like XML Copy Editor or an IDE with some plugin like Visual Studio Code with XML Tools and XML by Red Hat.
Preparation of well-formed XML documents¶
An XML document is said to be well-formed when it has no syntax errors. This includes the following aspects:
- Element names and their attributes must be spelled correctly.
- Attribute values must be enclosed in double or single quotes.
- The attributes of an element must be separated by whitespace.
- References to entities should be used where necessary.
- There must be a single root element.
- Every element must have a parent element except the root element.
- All elements must have an opening tag and a closing tag.
- Tags must be nested correctly.
- The XML declaration must be in the first line written correctly.
- The
CDATA
sections and comments must be spelled correctly.
Using namespaces in XML¶
XML namespaces are a mechanism for ensuring that the elements and attributes of an XML document have names unique. They are defined in a W3C recommendation. The problem they solve is the ambiguity that arises when a document XML contains names of elements or attributes from various vocabularies and results in various elements or attributes homonyms (with the same name): if each vocabulary is given a different namespace, the ambiguity is resolved.
For example:
<menu>
<option>Save</option>
<description>Save the current document</description>
</menu>
<menu>
<meats>
<veal_steak price="12.95" />
<sirloin_staeck price="13.60" />
</meats>
<fishes>
<baked_salmon price="16.20" />
<hake_in_green_sauce price="15.85" />
</fishes>
</menu>
So, if the <menu>
elements are included in an XML document, a name conflict arises. To solve it,
namespaces can be used. For example, typing:
<?xml version = "1.0" encoding = "UTF-8"?>
<e1:example xmlns:e1="http://www.abrirllave.com/example1"
xmlns:e2="http://www.abrirllave.com/ejemplo2">
<e1:menu>
<e1:option>Save</e1:option>
<e1:description>Save the current document</e1:description>
</e1:menu>
<e2:menu>
<e2:meats>
<e2:veal_steak price = "12.95" />
<e2:sirloin_staeck price = "13.60" />
</e2:meats>
<e2:fishes>
<e2:baked_salmon price = "16.20" />
<e2:hake_in_green_sauce price = "15.85" />
</e2:fishes>
</e2:menu>
</e1:example>
The following syntax is used to define a namespace:
xmlns: prefix="URI"
In the example, notice that, xmlns
is an attribute that was used in the start tag of the element<example>
and,
in this case, two namespaces have been defined that refer to the following URIs (Uniform Resource Identifier):
- http://www.abrirllave.com/example1
- http://www.abrirllave.com/example2
The defined prefixes are e1
and e2
, respectively. Prefixes have been added to the tags that appear in the
document: <e1: menu>
, <e2: menu>
, <e1: option>
, etc.
Defining a default namespace¶
Alternatively, a default namespace can be defined using the following syntax:
xmlns="URI"
In this way, both the element where the namespace has been defined, and all its successors (children, children of children, etc.), must belong to this namespace. For example:
<?xml version = "1.0" encoding = "UTF-8"?>
<example xmlns="http://www.abrirllave.com/example1">
<menu>
<option>Save</option>
<description>Save the current document</description>
</menu>
<example>
element and its contents.
in it. However, a second namespace is then defined, which by default affects the second <menu> element
that appears in the document and its successors:
,
,
<?xml version = "1.0" encoding = "UTF-8"?>
<example xmlns="http://www.abrirllave.com/example1">
<menu>
<option>Save</option>
<description>Save the current document</description>
</menu>
<menu xmlns="http://www.abrirllave.com/example2">
<meats>
<veal_steak price = "12.95" />
<sirloin_staeck price = "13.60" />
</meats>
<fishes>
<baked_salmon price = "16.20" />
<hake_in_green_sauce price = "15.85" />
</fishes>
</menu>
</example>
In an XML document, to indicate that certain elements -or all- do not belong to any namespace, the attribute is written
xmlns
empty, that is,xmlns=""
.
<?xml version = "1.0" encoding = "UTF-8"?>
<example xmlns="http://www.abrirllave.com/example1">
<menu>
<option>Save</option>
<description>Save the current document</description>
</menu>
<menu xmlns="http://www.abrirllave.com/example2">
<meats>
<veal_steak price = "12.95" />
<sirloin_staeck price = "13.60" />
</meats>
<fishes xmlns="">
<baked_salmon price = "16.20" />
<hake_in_green_sauce price = "15.85" />
</fishes>
</menu>
</example>
Bibliography, webography and credits¶
- Wikipedia contributors. (2020, September 13). Markup language. In Wikipedia, The Free Encyclopedia. Retrieved 15:51, September 15, 2020, from https://en.wikipedia.org/w/index.php?title=Markup_language&oldid=978142210
- Carlos Pes. (Febrer de 2017). Lenguajes de Marcas y Sistemas de Gestión de Información (LMSGI) disponible en Tutorial de LMSGI
- Bartolomé Sintés Marco. (Juny de 2020). XML: Lenguaje de marcas extensible, disponible en https://www.mclibre.org/consultar/xml/
- Mozilla Contributors (June 2021). XML introduction. Available at https://developer.mozilla.org/en-US/docs/Web/XML/XML_introduction
- IBM Corporation. (June 2020). XML Syntax Rules. Available at https://www.ibm.com/docs/en/scbn?topic=syntax-xml-rules