Skip to content

XML schemas and vocabularies

In unit 1 you have learned how to create well-formed XML documents, that is, following the XML syntactic rules.

Important

An XML document is valid when, in addition to being well-formed, it does not break none of the rules established in its structure definition.

The aim of this unit is to learn how structural rules can be defined to be able to create our own XML dialects and how these rules can define valid XML documents.

This kind of structure can be defined with the following languages:

  • DTD (Document Type Definition).
  • XML Schema.
  • RELAX NG (REgular LAnguage for XML Next Generation).

Look at this simple XML document called "note.xml":

<?xml version="1.0"?>
<note>
  <to>Tove</to>
  <from>Jani</from>
  <heading>Reminder</heading>
  <body>Don't forget me this weekend!</body>
</note> 

A DTD File

The following example is a DTD file called note.dtd that defines the elements of the XML document above (note.xml):

<!ELEMENT note (to, from, heading, body)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT heading (#PCDATA)>
<!ELEMENT body (#PCDATA)>
The first line defines the note element to have four child elements: "to, from, heading, body".

Line 2-5 defines the to, from, heading, body elements to be of type #PCDATA.

An XML Schema

The following example is an XML Schema file called note.xsd that defines the elements of the XML document above (note.xml):

<?xml version="1.0"?>
<xs:schema>
    <xs:element name="note">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="to" type="xs:string"/>
                <xs:element name="from" type="xs:string"/>
                <xs:element name="heading" type="xs:string"/>
                <xs:element name="body" type="xs:string"/>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
</xs:schema>
The note element is a complex type because it contains other elements. The other elements (to, from, heading, body) are simple types because they do not contain other elements. You will learn more about simple and complex types in the following chapters.

Document Type Definitions

DTD (document type definitions) is a schema definition language that already existed before the advent of XML (used in SGML). It was designed to work with SGML and could be used in many of the markup languages ​​based on it, such as XML or HTML.

When XML was defined, it was used to make a simplified version of DTD that was the original schema specification language.

The main purpose of DTDs is to provide a mechanism for validating the structures of XML documents and determining whether the document is valid or not. But this will not be the only advantage that DTDs will give us, but we can also use them to share information between organizations, because if someone else has our DTD they can send us information in our format and we can process it.

For a long time DTDs were the most widely used vocabulary definition system in XML but have now been overtaken by XML Schemas. However, it is still widely used, especially because it is much simpler.

The XML specification refers to DTDs as a method of defining XML vocabularies, but DTDs have a number of limitations that led the W3C to define a new specification. This specification was called the W3C XML Schema Definition Language (popularly called the XML Schema or XSD), and was created to replace the DTD as a vocabulary definition method for XML documents. Furthermore, unlike DTD, XSD is an XML dialect.

For these reasons, we will focus on XML Schema.

XML Schema Definition

The latest specification can be found at www.w3.org/XML/Schema.

The purpose of an XML Schema is to define the legal building blocks of an XML document:

  • the elements and attributes that can appear in a document
  • the number of (and order of) child elements
  • data types for elements and attributes
  • default and fixed values for elements and attributes

The success of XSD has been great, and it is now used for tasks other than simply validating XML. It is also used in other XML technologies such as XQuery, web services, etc.

The most important features that XSD provides are:

  1. It is written in XML and therefore it is not necessary to learn a new language to define XML schemas.
  2. It has its own data type system, so you can check the contents of the items.
  3. Supports namespaces to allow mixing different vocabularies.

<schema> definition

XSD is based on XML and must therefore comply with XML rules:

  • Although not required, the file is usually started with the XML declaration.
  • There is only one root element, which in this case is <schema>.

Due to the fact that an specific and well-known vocabulary is being used to be able to use the XML elements, the XSD namespace must always be specified: http://www.w3.org/2001/XMLSchema.

<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
    elementFormDefault="qualified">
    ...
</xs:schema>

xmlns:xs="http://www.w3.org/2001/XMLSchema" indicates that the elements and data types used in the schema come from the "http://www.w3.org/2001/XMLSchema" namespace. It also specifies that the elements and data types that come from the "http://www.w3.org/2001/XMLSchema" namespace should be prefixed with xs:.

elementFormDefault="qualified" indicates that any elements used by the XML instance document which were declared in this schema must be namespace qualified.

XSD tags

XSD defines many tags and not all of them can be seen here. You can find all possible tags in the specification www.w3.org/TR/xmlschema11-1.

The <schema> tag can have different attributes, some of which we can see in the following table

Attribute Meaning
attributeFormDefault The value must be "qualified" or "unqualified". Default is "unqualified". "unqualified" indicates that attributes from the target namespace are not required to be qualified with the namespace prefix.
elementFormDefault The form for elements declared in the target namespace of this schema. The value must be "qualified" or "unqualified". Default is "unqualified". "unqualified" indicates that elements from the target namespace are not required to be qualified with the namespace prefix.
version Defines which version of the schema document we are defining (not the XML Schemas version).

From the root element you can start defining the tags of the vocabulary you want to create.

Associate a schema to an XML document

Unlike other definition languages ​​-such as DTDs, in which the association must be specified in the XML document- you do not need to modify the XML file to validate an XML with an XSD. However, it is also possible to do this by defining the namespace.

To associate an XML document with a schema document, you need to define the namespace with the xmlns attribute, and use one of the language attributes to define the schema file:

<urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:noNamespaceSchemaLocation="urlset.xsd">
References to the schema file can be defined in two ways, which can be seen in the table.

Attribute Meaning
noNamespaceSchemaLocation No namespaces will be used in the document.
schemaLocation Used when explicitly using namespace names in tags.

Validation from command line

You can validate from command line using the xmllint command.

xmllint doc.xml --schema doc.xsd

Elements definition

Elements are defined using the <element> tag and the name attribute. Optionally, it can include the data type of its containing element. In the following example we define an element named firstname of type string.

<xs:element name="firstname" type="xs:string" />
This element complies the definition:

<firstname>Homer</firstname>
The XSD divides the elements into two large groups based on the data they contain:

  • Elements with simple type content: Elements without attributes that only contain data.
  • Elements with complex type content: Elements that may have attributes, no content, or contain elements.

From the definition, it can be seen that there will almost always be some complex type, as the root will usually contain other elements.

Elements with simple type content

Info

Elements with simple type content are those that do not contain other elements or have attributes.

XSD version 1.1 defines about fifty different data types, which can be found in their definition www.w3.org/TR/xmlschema11-2. Among the most used are those in the following table:

Type Data that can be stored there
string Character strings
decimal Numeric values
boolean Can only contain ‘true’ or ‘false’ or (1 or 0)
date Dates in form (YYYY-MM-DD)
anyURI Site references (URLs, disk paths…)
base64binary Binary data encoded in base64
integer Integers

From the basic types, the standard creates others with the aim of having data types that can be better adapted to the objectives of the person designing the scheme. Therefore there are the types called positiveInteger, nonNegativeInteger, gYearMonth, unsignedInt,...

Data types allow you to restrict the values ​​that XML elements will contain. For example, if you start from the following definition:

...
    <xs:element name="position" type="xs:integer" />
</xs:schema>
An item can only be validated if its contents are an integer. For example, the following example will not validate:

<position>First</position>

In the following table you can see examples of definitions of elements and values ​​that validate them.

Label Example
<xs:element name="dia" type="xs:date" /> <dia>2011-09-15</dia>
<xs:element name="height” type="xs:integer" /> <height>220</height>
<xs:element name="name” type="xs:string” /> <name>Pere Puig</name>
<xs:element name="size” type="xs:float” /> <size>1.7E2</size>
<xs:element name="place” type="xs:anyURI” /> <place>http://www.ioc.cat</place>

Cardinality

When an element is defined in XSD, it is in such a way that the tag must appear once. It is quite common for tags to be repeated certain times. In XSD this has been simplified by attributes of the <element> tag that determine the cardinality of the elements:

  • minOccurs: allows you to define how many times an element must come out at least. A value of ‘0’ indicates that the item may not come out.
  • maxOccurs: used to define the maximum number of times an element can be output. unbounded implies that there is no limit to the times it can go out.

Using the attributes, you can set the <firstname> element to go out once and the <surname> element a maximum of twice.

<xs:element name="firtname" />
<xs:element name="surname" maxOccurs="2" />

Fixed and default values

Values ​​can also be given to elements with the fixed, default, and nullable attributes.

The fixed attribute allows you to define a required value for an element:

<xs:element name="centre" type="xs:string" fixed="IOC" />
So only content with the specified value (or nothing) can be defined:

<centre />
<centre>IOC</centre>

But never a different value than specified:

<!-- validation error -->
<centre> Institut Cendrassos </centre>
Unlike fixed, default assigns a default value but lets it be changed in the contents of the element.

<xsi:element name="centre" type="xs:string" default="IOC" />

The definition would validate with the following three cases:

<center />
<centre>IOC</centre>
<centre>Institut Cendrassos</centre>

The null attribute is used to indicate whether null content is allowed. Therefore, you can only take the values ​​yes or no.

Simple custom types

Sometimes it may be interesting to define values ​​for elements that do not necessarily have to match the standards. XSD allows you to define custom types of data. For example, if you want a numeric value but do not accept all values ​​but a subset of integers.

To define custom simple types, the type is not placed in the element and a <simpleType> child is defined.

<xs:element name="person">
    <xs:simpleType>
    ...
    </xs:simpleType>
</xs:element>

simpleType specifies the modification you want to make. The most common is that the changes are made with list, union, restriction or extension.

Lists

Although lists of values ​​can be defined, it is not highly recommended to use them. Most experts believe that it is better to define the values ​​in the list using tag repetitions.

Using list will allow you to define that an item can contain lists of values. Therefore, to specify that a element can contain a list of dates would be defined:

<xs:element name="matches">
    <xs:simpleType>
        <xs:list itemType="xs:date" />
    </xs:simpleType>
</xs:element>

The element would validate with something like:

<matches>2011-01-07 2011-01-15 2011-01-21 </matches>
simpleType elements can also be defined with a name outside the elements and then used as a custom data type.

<xs:simpleType name="days">
    <xs:list itemType="xs:date" />
</xs:simpleType>

<xs:element name="matches" type="days" />

Union

Using custom named types, union type modifications can be created. Union modifiers are used to allow different types to be mixed into the contents of an element.

Defining the <price> element will allow the element to be of type value or symbol type.

<xs:element name="price">
    <xs:simpleType>
        <xs:union memberTypes="value symbol" />
    </xs:simpleType>
</xs:element>

With this we could assign values ​​like these:

<price>25 </price>

Restrictions

Without a doubt the most interesting modifier is the one that allows to define restrictions to the base types. With the restriction modifier you can create data types in which only certain values ​​are accepted, the data meets a certain condition, and so on.

The <birth> element can only have integer values ​​between 1850 and 2011 if defined as follows:

<xs:element name="birth">
    <xs:simpleType>
        <xs:restriction base="xs:integer">
            <xs:maxInclusive value="2011" />
            <xs:minInclusive value="1850" />
        </xs:restriction>
    </xs:simpleType>
</xs:element>
We can also define a simple type for later use:

<xs:simpleType name="year_birth">
    <xs:restriction base="xs:integer">
        <xs:maxInclusive value="2011" />
        <xs:minInclusive value="1850" />
    </xs:restriction>
</xs:simpleType>

<xs:element name="birth" type="year_birth" />

Restrictions of many types can be defined by means of attributes (Table.11). Normally the values ​​of the constraints are specified in the value attribute:

Result Elements
maxInclusive/maxExclusive Used to define the maximum numeric value that an item can take.
minInclusive/minExclusive Set the minimum value for the value of an item.
length With lenght we restrict the length that a text element can have. We can use <xs:minLength> and <xs:maxLenght> to be more accurate.
enumeration Only allows the element to have one of the values ​​specified in the different lines.
totalDigits Defines the number of digits of a numeric value.
fractionDigits Used to specify the number of decimals that a numeric value can have.
pattern Defines a regular expression to which the value of the element must fit in order to be valid.

For example, the value of the <answer> element can only have one of the three values ​​"A", "B" or "C" if it is defined in this way:

<xs:element name="answer">
    <xs:simpleType>
        <xs:enumeration value="A" />
        <xs:enumeration value="B" />
        <xs:enumeration value="C" />
    </xs:simpleType>
</xs:element>

One of the most interesting constraints is those defined by the pattern attribute, which allows you to define constraints from regular expressions. As a general rule, we have that if a character is specified in the pattern it must appear in the content; the other possibilities can be seen in the table:

Symbol Equivalence
. Any character
\d Any digit
\D Any non-digit character
\s Non-printable characters: spaces, tabs, line breaks…
\S Any printable character
x* The previous character must appear 0 or more times
x+ The previous character must appaer 1 or more times
x? The previous character must appear or not
[abc] There must be some character inside
[0-9] There must be a value between the two specified, inclusive
x{5} The x expression must appear 5 times.
x{5,} The x expression must appear 5 or more times.
x{5,8} The x expression must appear from 5 to 8 times.

Using this system you can define highly customized data types. For example, we can define that a data must have the form of a DNI (8 digits, a hyphen and a capital letter) with this expression:

<xs:simpleType name="dni">
    <xs:restriction base="xs:string">
        <xs:pattern value="[0-9]{8}-[A-Z]" />
    </xs:restriction>
</xs:simpleType>

More information

In Quick-Start: Regex Cheat Sheet you will find a quick guide to Regular Expressions and a lot of related resources.

Elements with complex type content

Info

Elements with complex type content are those that have attributes, contain other elements, or have no content.

Elements with complex content have received a lot of criticism because they are considered too complicated, but they should be used because in all schema files there will usually be a complex type: the root of the document.

There are considered to be four major groups of complex content:

  • Those in its content have only data. Therefore, they are like those of simple types but with attributes.
  • Items that contain only items in the content.
  • The empty elements.
  • Elements with mixed content.

Elements with complex type are defined by specifying that the data type of the element is <xs:complexType>.

<xs:element name="class">
    <xs:complexType>
          ....
    </xs:complexType>
</xs:element>

As with simple types, complex named types can be defined for reuse as custom types.

<xs:complexType name="course">
 ...
</xs:complexType>

<xs:element classType="course"/>

Content made up of elements

Elements that contain other elements can also be defined in XSD within a <complexType> and can be elements in the following table:

Label Used for
sequence Specify the content as an ordered list of items.
choice Allows you to specify alternative items.
all Define the content as a cluttered list of items.
complexContent Extend or restrict complex content.

Sequence

The <sequence> element allows you to specify the elements that should be part of an element's content. Even in the case where there is only a single tag it can be defined as a sequence.

Its most important condition is that elements in XML document must appear in the same order in which they are defined in the sequence.

<xs:element name="person">
    <xs:complexType>
        <xs:sequence>
            <xs:element name="name" type="xs:string" />
            <xs:element name="surname" type="xs:string" maxOccurs="2" />
            <xs:element name="type" type="xs:string" />
        </xs:sequence>
    </xs:complexType>
</xs:element>

The example above defines that one or two surnames may appear before the appearance of <type>.

<person>
    <name>Marcel</name>
    <cognom>Puig</cognom>
    <cognom>Lozano</cognom>
    <type>Professor</type>
</person>

It will not validate any content if some item is not in exactly the same order.

<person>
    <type>Professor</type>
    <cognom>Puig</cognom>
    <name>Marcel</name>
</person>

Sequences may contain other sequences of elements.

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
  <xs:element name="person">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="fullname">
          <xs:complexType>
            <xs:sequence>
              <xs:element name="name" type="xs:string" />
              <xs:element name="surname" type="xs:string" maxOccurs="2" />
            </xs:sequence>
          </xs:complexType>
        </xs:element>
        <xs:element name="profession" type="xs:string" />
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>

All

The most important difference between the <all> and <sequence> element is the order. The <all> element allows you to specify a sequence of elements but allows them to be specified in any order.

Therefore, if we define the <person> element as follows:

<xs:element name="person">
    <xs:complexType>
        <xs:all>
            <xs:element name="nom" />
            <xs:element name="surname" />
        </xs:all>
    </xs:complexType>
</xs:element>

It will help us to validate both this document:

<person>
    <name>Pere</name>
    <cognom>Garcia</nom>
</person>

like this:

<person>
    <cognom>Garcia</nom>
    <name>Pere</name>
</person>
 ```

But the limitations of this element that were not present in the ordered sequences must always be taken into account:

* There can only be elements within it. There can be no sequences or alternatives.
* Cardinality cannot be used in the elements it contains, as it would cause a problem of non-determinism.

Therefore, the following example is incorrect, as it is requested so that `<surname>` is able to appear twice.

```xml
<xs:all>
    <xs:element name="name" type="xs:string" />
    <xs:element name="surname" maxOccurs="2" type="xs:string" />
</xs:all>

One possible way to allow the first and last names to be specified in any order would be to do the following:

<xs:complexType>
    <xs:choice>
        <xs:sequence>
            <xs:element name="name" type="xs:string" />
            <xs:element name="surname" type="xs:string" maxOccurs="2" />
        </xs:sequence>
        <xs:sequence>
            <xs:element name="surname" type="xs:string" maxOccurs="2" />
            <xs:element name="name" type="xs:string" />
        </xs:sequence>
    </xs:choice>
</xs:complexType>

Choice

The <choice> element is used to choose one of the alternatives presented.

In this example, the person element may contain either the or tag, but not both.

<xs:complexType name="person">
    <xs:choice>
        <xs:element name="nomCognoms" type="xs:string" />
        <xs:element name="dni" type="xs:string" />
    </xs:choice>
    ...

Alternatives may include sequences or other <choice> elements. The following definition is a more elaborate example than the previous one and allows you to choose between the elements <name> and <surname> or <dni>.

<xs:choice>
    <xs:sequence>
        <xs:element name="name" type="xs:string" />
        <xs:element name="surname" type="xs:string" maxOccurs="2" />
    </xs:sequence>
    <xs:element name="dni" type="xs:string" />
</xs:choice>

complexContent

The complexContent tag allows you to define extensions or restrictions to a complex type that contains mixed content or just elements.

This makes it possible to extend an existing complex content with an extension or to restrict its contents.

For example, if you have already defined a full name data type in which the <name> and <surname> elements exist, you can reuse the definition to define a new data type, address book, in which the e-mail.

<xs:complexType name="fullname">
    <xs:sequence>
        <xs:element name="name" type="xs:string" />
        <xs:element name="surname" type="xs:string" maxOccurs="2" />
    </xs:sequence>
</xs:complexType>

<xs:complexType name="agenda">
    <xs:complexContent>
        <xs:extension base="fullname">
            <xs:sequence>
                <xs:element name="email" type="xs:string" />
            </xs:sequence>
        </xs:extension>
    </xs:complexContent>
</xs:complexType>

In this way, an agenda element can be defined:

<xs:element name="persona" type="agenda" />

which must have the three elements <name>, <surname>, and <email>:

<person>
    <name>Pere</name>
    <cognom>Garcia</cognom>
    <email>pgarcia@ioc.cat</email>
</person>

Attributes

A basic feature of XSD is that only complex type elements can have attributes. In essence, there is not much difference between defining an element or an attribute, as it is done in the same way but using the attribute tag.

The data types are the same and therefore can have basic types as in the following example:

<xs:attribute name= "number" type="xs:integer" />
Restrictions can be placed in the same way as in the elements. In this example, the year attribute cannot have values ​​greater than 2011 if it is defined as follows:

<xs:attribute name="year">
    <xs:simpleType>
        <xs:restriction base="xs:integer">
            <xs:maxInclusive value="2021" />
        </xs:restriction>
    </xs:simpleType>
</xs:attribute>
Unless otherwise specified, attributes are always optional.

The <attribute> tag has a series of attributes that allow you to define extra features about the attributes.

Attribute Use
use Specifies whether the attribute is required, optional, or prohibited.
default Sets a default value.
fixed Used to define required values ​​for attributes.
form Defines whether the attribute should go with the namespace alias (qualified) or not (unqualified).

For example, the year attribute must be specified if it is defined as follows:

<xs:attribute name="year" type="xs:integer" use="required" />

Text-only elements with attributes

A complex text-only element can contain text and attributes. In this case, the content of complexType will be a simpleContent. simpleContent allows you to define restrictions or extensions to elements that only have data as content.

The most important difference is that in this case you can define attributes in the element. Attributes are added by defining an extension to the type used in the element.

xs:extension is used to extent a simpleType or a complexType element.

In this example, the <shoesize> element has integer content and defines one attributes, country, that are an string.

<xs:element name="shoesize">
  <xs:complexType>
    <xs:simpleContent>
      <xs:extension base="xs:integer">
        <xs:attribute name="country" type="xs:string" />
      </xs:extension>
    </xs:simpleContent>
  </xs:complexType>
</xs:element> 
For instance:

<shoesize country="france">35</shoesize>

Empty elements

For XSD elements without content are always of complex type. The definition simply does not specify any content and we will have an empty element.

<xs:element name="delegate">
    <xs:complexType />
</xs:element>

The definition allows you to define the item as follows:

<delegate />

If the element needs attributes they are simply specified within the complexType.

<xs:element name="delegate">
    <xs:complexType>
        <xs:attribute name="year" use="required" type="xs:gYear" />
    </xs:complexType>
</xs:element>

And you can now define the attribute in the empty element:

<delegate year="2012" />

Mixed content

Mixed content elements are elements that have both elements and text. It was designed to include elements in the middle of a narrative text.

In XSD the mixed content is defined by putting the mixed="true" attribute in the definition of the element.

<xs:element name="letter">
    <xs:complexType mixed="true">
        <xs:sequence>
            <xs:element name="name" type="xs:string" />
            <xs:element name="dia" type="xs:gDay" />
        </xs:sequence>
    </xs:complexType>
</xs:element>
This would allow us to validate content like this:

<card>Dear Sir <name>Peter</name>:
    I am sending you this letter to remind you that we have stayed for
    meet us on <day>12</day>
</card>

Example of creating an XSD

XSD vocabulary definitions can be created from the idea of ​​what we want the data to contain or from a sample XML file.

Practical case

We want to store in an XML document some website bookmarks:

<?xml version="1.0" encoding="UTF-8"?>
<bookmarks>
    <website>
        <name>Abrirllave</name>
        <description>Tutoriales de informática.</description>
        <url>http://www.abrirllave.com/</url>
    </website>
    <website>
        <name>Wikipedia</name>
        <description>La enciclopedia libre.</description>
        <url>http://www.wikipedia.org/</url>
    </website>
    <website>
        <name>W3C</name>
        <description>World Wide Web Consortium.</description>
        <url>http://www.w3.org/</url>
    </website>
</bookmarks>
We need to create an XSD file to validate this XML document.

The first step is to decide which kind of elements must be created. As the root element always contains elements we have to define it as a complex type.

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="bookmarks">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="website" minOccurs="1" maxOccurs="unbounded">
          <xs:complexType>
            <xs:sequence>            
              <xs:element name="name" type="xs:string"/>
              <xs:element name="description" type="xs:string"/>
              <xs:element name="url" type="xs:anyURI"/>
            </xs:sequence>
          </xs:complexType>
        </xs:element>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>

Credits and bibliography