Chapter 9 The Config4* Schema Language

9.1 Introduction

A schema is a blueprint or definition of a system. For example, a database schema defines the layout of a database: its tables, the columns within those tables, and so on. It is common (but not a requirement) for a schema to be written in the same syntax as the system it defines. For example, a database’s schema might be stored within a table of the database itself.

Another technology that uses schemas is XML. The first schema language for XML was called document type definition (DTD). Many people felt DTD was sufficient to define schemas for text-oriented XML documents, which tend to have a simple structure, but not flexible enough to define schemas for more structured, data-oriented XML documents. Because of this, several competing XML schema languages were defined, including XML Schema and RELAX NG.

By itself, a schema it not very useful; you also need to have a piece of software, called a schema validator, that can compare a system (database, XML file or whatever) against the system’s schema definition and report errors. Within the Config4* library is a class called SchemaValidator that, as its name suggests, implements a schema validator. Application developers can use this to automate useful validity checks on configuration information.

As you read this chapter, you may wish to test some of the examples to ensure you fully understand the semantics of the schema language. You do not need to write any code to do such testing. Instead, you can use the validate command of the config4cpp or config4j utilities to test a schema. Section 7.3 discusses how to use that command.

9.2 Syntax

A Config4* schema consists of an array of strings. The grammar for a string within a schema is shown in Figure 9.1. As can be seen, a string within a schema can be one of the following: an identifier rule, an ignore rule or a type definition.

Figure 9.1: Formal grammar of the Config4* schema language

Notation: | denotes choice, and {...}* denotes 0 or more repetitions.

stringInSchema    = idRule | ignoreRule | TypeDefinition

idRule            = OptOrReq Name ’=’ TypeName OptArgList

OptOrReq          = empty | ’@optional’ | ’@required’

ignoreRule        = ’@ignoreScopesIn’     Name
                  | ’@ignoreVariablesIn’  Name
                  | ’@ignoreEverythingIn’ Name

TypeDefinition    = ’@typedef’ TypeName ’=’ TypeName OptArgList

OptArgList        = empty | ’[’ ArgList ’]’
ArgList           = empty | Arg { ’,’ Arg }*
Arg               = IDENTIFIER | STRING

Name              = IDENTIFIER
TypeName          = IDENTIFIER

9.2.1 Identifier Rules

An identifier rule specifies the permissible type of an item of configuration information. This can be illustrated with an example. Let’s assume you are writing an application that requires configuration information like that shown in Figure 9.2. A suitable schema for this configuration is shown (in Java syntax) in Figure 9.3. Each string in that schema is an identifier rule; it specifies the permitted type for a named item of configuration information. For example, the first rule in Figure 9.3 specifies that timeout is of type durationMilliseconds, and the fourth rule specifies that log is a scope.

Figure 9.2: Example configuration for an application

timeout = "2 minutes";
fonts = ["Times Roman", "Helvetica", "Courier"];
background_colour = "white";
log {
    dir = "C:\foo\logs";
    level = "1";
}

Figure 9.3: Schema for the example configuration shown in Figure 9.2

String[] schema = new String[] {
    "timeout = durationMilliseconds",
    "fonts = list[string]",
    "background_colour = enum[grey, white, yellow]",
    "log = scope",
    "log.dir = string",
    "log.level = int[0, 3]"
};

The simplest form of an identifier rule is name=type. The type can be optionally followed by a list of arguments. The use of arguments is illustrated by the rules for fonts, log.level and background_colour in Figure 9.3. In the rule for fonts, the argument specifies that each item in the fonts list should be interpreted as a string, rather than, say, a boolean or int. In the rule for log.level, the arguments specify minimum and maximum values for the integer. In the rule for background_colour, the enum type specifies an enumeration of allowable values, which is indicated by its list of arguments.

9.2.2 The @optional and @required Keywords

You can optionally use one of the keywords @optional or @required at the start of an identifier rule. For example:

String[] schema = new String[] {
    "x = string", // defaults to @optional
    "@optional y = string",
    "@required z = string"
};

If you do not specify one of those keywords, then the default behaviour is as if you had specified @optional.

The semantics of @required are that the specified entry must be present in the configuration scope being validated. Conversely, the semantics of @optional are that the specified entry may be (but is not required to be) in the scope being validated.

The default semantics of entries being optional means that a schema works well with both fallback configuration (Section 3.6.3) and default parameters passed to lookup-style operations (Section 3.7).

The semantics of "uid-" entries are that they may appear zero or more times. Because of this, "uid-" entries are intrinsically optional. If you try to use @required with a "uid-" entry, then the schema validator throws an exception message.

9.2.3 Defining a New Type

A type definition defines a new type in terms of an existing type. As an example of this, Figure 9.4 shows a revised schema for the configuration previously shown in Figure 9.2. This revised schema defines two new types, colour and logLevel, and then uses them to specify the types of the background_colour and log.level variables.

Figure 9.4: Alternative schema for the example configuration shown in Figure 9.2

String[] schema = new String[] {
    "@typedef colour = enum[gray, white, yellow]"
    "@typedef logLevel = int[0, 5]",
    "timeout = durationMilliseconds",
    "fonts = list[string]",
    "background_colour = colour",
    "log = scope",
    "log.dir = string",
    "log.level = logLevel"
};

The ability to define new types serves two purposes. First, it helps to ensure consistency if you need to use a type—such as colour or logLevel—for several variables in a schema. Second, and more importantly, it enables you to work around a limitation in the syntax of the schema language. To understand this, let’s assume the colour_list variable in a configuration file specifies a list of colours. You cannot specify this in a schema with the following:

String[] schema = new String[] {
    "colour_list = list[enum[gray, white, yellow]]"
};

This is because the schema syntax does not permit the nesting of argument lists. You can work around this syntactic limitation with the aid of a @typedef statement, as shown below:

String[] schema = new String[] {
    "@typedef colour = enum[gray, white, yellow]",
    "colour_list = list[colour]"
};

9.2.4 Available Schema Types

A complete list of the built-in schema types are shown in Table 9.1.

Table 9.1: Built-in schema types

Type Explanation

boolean "true" or "false"

durationMicroseconds^* A duration of time

durationMilliseconds^* A duration of time

durationSeconds^* A duration of time

enum[name1, ...] A enumeration of the specified names

float^* A decimal number

float_with_units[units1, ...] "<float> <units>"

int^* An integer number

int_with_units[units1, ...] "<int> <units>"

list[type] A list of the specified type

memorySizeBytes^* Memory size expressed as one of: byte, bytes, KB, MB or GB

memorySizeKB^* Memory size expressed as one of: KB, MB, GB or TB

memorySizeMB^* Memory size expressed as one of: MB, GB, TB or PB

scope A scope

string^* A string

table[name1, type1, ...] A table containing columns of the specified names and types

tuple[name1, type1, ...] A tuple containing named entries of the specified types

units_with_float[units1, ...] "<units> <float>"

units_with_int[units1, ...] "<units> <int>"

^*This type can take an optional [min, max] pair of arguments.

One of the types in Table 9.1 is scope, which, as its name suggests, indicates that an entry in a configuration file is a scope. All the remaining types in the table fall into two categories: string-based types and list-based types. I discuss those in the following subsections.

9.2.4.1 String-based Types

The boolean type is a string in which only the values "true" and "false" are valid. The boolean type does not take any arguments.

The int and float types are strings that can be parsed as integer and floating-point numbers. By default, these types have no limit on the range of acceptable values. However, both types can take a pair of arguments that specify a minimum and maximum range of acceptable values. For example, int[0, 5] requires an integer in the range zero to five.

By default, the string type does not place any restriction on the length of a string value. However, string can take a pair of arguments that specify a minimum and maximum length for a string. For example, string[2,5] requires a string between two and five characters long.

The enum type requires one or more arguments. The arguments denote valid values for the enum type. For example:

String[] schema = new String[] {
    "@typedef colour = enum[grey, white, yellow]",
    ...
};

The int_with_units type requires one or more arguments, which specify a enumeration of allowable units. For example, to accept temperature values in the forms "27 Celsius" and "81 Fahrenheit" then you could define a temperature type as follows:

String[] schema = new String[] {
    "@typedef temperature = int_with_units[Celsius, Fahrenheit]",
    ...
};

As you might expect, the float_with_units type is similar to the int_with_units type, except that the numeric value can be a floating-point number instead of an integer.

The int_with_units and float_with_units types are ideal if the unit is specified after the numeric value. If the unit is specified before the numeric value then you should use the units_with_int or units_with_float type instead. For example:

String[] schema = new String[] {
    "@typedef money = units_with_float[EUR, GBP, USD]",
    ...
};

Later, in Section 9.2.5, I will explain how to define a schema so that currency symbols (such as €, £ and $) can be used instead of currency names.

The memorySizeBytes, memorySizeKB and memorySizeMB types are built on top of float_with_units. The acceptable units you can use with memorySizeBytes are byte, bytes, KB, MB and GB. The acceptable units for memorySizeKB are KB, MB, GB and TB. And the acceptable units for memorySizeMB are MB, GB, TB and PB. The memory-size types can take a pair of arguments that specify minimum and maximum sizes, but a discussion of this is deferred until Section 9.2.5.

The duration types (durationMicroseconds, durationMilliseconds and durationSeconds) are built on top of float_with_units, but they also accept the value "infinite". The acceptable units for use with the duration types are as follows:

durationMicroseconds:: microsecond, millisecond, second and minute.
durationMilliseconds:: millisecond, second, minute, hour, day and week.
durationSeconds:: second, minute, hour, day and week.

You can also specify the plural forms of duration units, for example, milliseconds instead of millisecond.

The duration types can take a pair of arguments that specify minimum and maximum durations, but a discussion of this is deferred until Section 9.2.5.

9.2.4.2 List-based Types

There are three list-based schema types: list, tuple and table. I will discuss each in turn.

The list type takes a single argument that denotes the type for every item in the list. For example, the schema below indicates that variable x is a list of strings, variable y is a list in which each item is an integer, and variable z is a list in which each item is of type money:

String[] schema = new String[] {
    "@typedef money = units_with_float[EUR, GBP, USD]",
    "x = list[string]",
    "y = list[int]",
    "z = list[money]",
    ...
};

The tuple type uses a list to emulate a compound data structure, akin to a Pascal record, a C/C++ struct, or a POJO (that is, a Plain Old Java Object). For example, consider the following C++ type:

struct person {
    string    name;
    int       age;
    float     height;
};

In a configuration file, we might wish to represent person data structures as follows:

employee = ["John Smith", "42", "186 cm"];
manager = ["Sam White", "39", "170 cm"];

Notice that both of the above lists contain three items that correspond to the name, age and height fields of the C++ struct. We can validate those lists by using the tuple type, which takes one or more pairs of arguments that denote the type and name of a field within the struct:

String[] schema = new String[] {
  "@typedef size = float_with_units[cm, m, inches, feet]",
  "@typedef person = tuple[string,name, int,age, size,height]",
  "employee = person",
  "manager = person",
};

The arguments to a tuple specify not just the type of each item in the list, but also the name of that item. This enables the schema validator to produce informative error messages. As an example of this, assume that the example.cfg file contains the following:

foo {
  employee = ["John Smith", "42", "hello"];
  manager = ["Sam White", "39", "170 cm"];
}

If we perform a schema validation on the foo scope, then we will receive the following error message:

example.cfg: bad size value (’hello’) for element 3 (’height’) of the
’foo.employee’ person: should be in the format ’<float> <units>’ where
<units> is one of: ’cm’, ’m’, ’inches’, ’feet’

To understand the table type, consider the following example:

people = [
  # name           age     height
  #-------------------------------
  "John Smith",   "42",   "186 cm",
  "Sam White",    "39",   "170 cm",
];

Syntactically, the people variable is a list of strings. However, the list is formatted to look like a table that consists of several rows, each of which contains three columns. The comment at the top indicates the name for each column. A suitable schema for this can be defined with the table type, as shown below:

String[] schema = new String[] {
  "@typedef size = float_with_units[cm, m, inches, feet]",
  "people = table[string,name, int,age, size,height]"
};

The arguments of the table type are specified as pairs that indicate the type and name of each column in the table. If the schema validator encounters an error, then the error message indicates the row and column number of the invalid item, plus the name of the column. For example, if we replace "186 cm" in the first row of the example table, then the schema validator will report the following error:

example.cfg: bad size value (’hello’) for the ’height’ column in row 1
of the ’people’ table: should be in the format ’<float> <units>’ where
<units> is one of: ’cm’, ’m’, ’inches’, ’feet’

9.2.5 Using String-based Arguments

The schema grammar shown in Figure 9.1 indicates that an argument used in a rule can be an identifier or a string literal. In the example below, the arguments are identifiers:

String[] schema = new String[] {
    "fonts = list[string]",
    "background_colour = enum[grey, white, yellow]",
    "log.level = int[0, 3]"
};

It is common to think of identifiers as being textual names, so string, grey, white and yellow are clearly identifiers. However, the definition of an identifier given in Section 8.4 indicates that numbers are also classified as identifiers. Thus, the arguments 0 and 3 used in the definition of log.level are identifiers.

The schema grammar permits string literals to be used (instead of identifiers) for arguments. Thus, the above example could be written in as follows:

String[] schema = new String[] {
    "fonts = list[\"string\"]",
    "background_colour = enum[\"grey\", \"white\", \"yellow\"]",
    "log.level = int[\"0\", \"3\"]"
};

As you can see, the need to escape the double quotes makes this syntax somewhat cumbersome. For this reason, it is common to write arguments as identifiers rather than as string literals whenever possible. However, sometimes it is necessary to write arguments as strings. Two examples come to mind.

First, use of string literals enables schema validation for strings that contain scientific or currency symbols. Thus, the following schema:

String[] schema = new String[] {
    "money = units_with_float[\"€\", \"£\", \"$\"]"
};

can validate a variable such as:

money = "£19.99";

Second, you need to use string literals if you want to express minimum and maximum values for memory sizes or durations:

String[] schema = new String[] {
    "timeout = durationSeconds[\"10 seconds\", \"5 minutes\"]",
    "RAM_size = memorySizeMB[\"512 MB\", \"4 GB\"]"
};

9.2.6 Ignore Rules

There are many “framework” libraries that simplify the development of specific types of software, such as GUI applications or client-server applications. Let’s assume you are developing a framework library called YAF (an acronym for “Yet Another Framework”). YAF provides useful built-in functionality, but it also has a documented plug-in architecture, so extra functionality can be added easily by third-party companies.

A YAF-based application is likely to require configuration information for all of the following: (1) the core functionality of YAF; (2) each plug-in that is loaded by YAF; and (3) application-level code. Because you are the developer of YAF, you can define a schema for the configuration variables required for (1). However, you are unable to predict what the schema should be for (2) or (3). The ignore schema statements, which I will discuss in this section, make it possible for you to write a schema that can validate the configuration information for (1) while ignoring configuration information for (2) and (3). Then, the developer of a plug-in can write another schema to validate the configuration information for that plug-in. Likewise, an application developer can write another schema to validate the configuration information specific to the application code. To illustrate this, consider the example configuration file shown in Figure 9.5, and its schema shown in Figure 9.6.

Figure 9.5: Example configuration for an application built with YAF

foo {
    timeout = "2 minutes";
    fonts = ["Times Roman", "Helvetica", "Courier"];
    log {
        dir = "C:\foo\logs";
        level = "1";
    }
    application { ... }
    plugins {
        load = ["tcp", "shared_memory"];
        tcp {
            host = "localhost";
            port = "5050";
            buffer_size = "8 KB";
        }
        ssl { ... }
        shared_memory { ... }
    }
}

Figure 9.6: Schema for the configuration shown in Figure 9.5

String[] schema = new String[] {
    "timeout = durationMilliseconds",
    "fonts = list[string]",
    "log = scope",
    "log.dir = string",
    "log.level = int[0, 3]",
    "application = scope",
    "@ignoreEverythingIn application",
    "plugins = scope",
    "plugins.load = list[string]",
    "@ignoreScopesIn plugins",
};

The application and plugins scopes store configuration information for application-level code and plugins. The "@ignoreEverythingIn application" rule instructs the schema validator to ignore everything (that is, variables and nested scopes) in the application scope. The "@ignoreScopesIn plugins" rule instructs the schema validator to ignore nested scopes in the plugins scope. By not ignoring variables in that scope, the schema can validate the plugins.load variable.

There is a third ignore command called @ignoreVariablesIn. That command instructs the schema validator to ignore variables (but not nested scopes) in the specified scope. That command is provided for completeness, but I have not (yet) found a non-contrived use for it.

If you use the config2cpp and config2j utilities to generate a schema from a fallback configuration file, then those utilities use built-in heuristics to decide what the schema should be. However, as discussed in Section 6.4, you can use a second configuration file, such as that shown in Figure 6.4, to tweak the generated schema. This second configuration file has an ignore_rules configuration variable that you can use to specify a list of ignore rules that will then be copied into the generated schema.

9.3 Using Code to Define Schema Types

Let’s assume you routinely write applications that obtain, say, email addresses and dates from configuration files. Unfortunately, Config4* does not have built-in schema types for email addresses or dates. Because of this, you may decide to write application code to perform the necessary validation checks. Although this approach will work, you will end up cluttering application code with hand-written validation checks. It would be preferable for emailAddress and date to be built-in types for the schema validator. This raises two interesting questions.

Question one: why doesn’t the schema validator have emailAddress and date types? The answer is a combination of three reasons. First, I have not needed those types in my own projects so far, and I don’t have the time to be adding support for types that I (or my colleagues or customers) may not need. Second, there are many different (and sometimes conflicting) standardised ways to write a date, and I don’t know which ones I should support in a date type. Finally, I realised that I cannot hope to predict all the schema types that somebody, somewhere, will require.

Question two: is it possible to add those types to the schema validator? The answer is yes. The schema validator has an API that enables people to extend it with new schema types. You will find the full details of how to do this in the Config4* A++ API and Config4* Java API manuals.

9.4 Summary

Config4* provides a schema language that you can use to define the entries permitted within the configuration scope for an application. An application can use the SchemaValidator class to validate its configuration scope against the schema. If a schema validation error is encountered, then a ConfigurationException that indicates the error is thrown.

The schema language is concise and simple to learn. It has many built-in types, plus an API that enables developers to add types specific to their needs. The schema language also provides several ignore rules that enable the entries in a nested scope to be ignored during schema validation. A motivating use of this is to enable schema validation for a framework library to ignore configuration entries specific to plug-ins or application-level code.

Type	Explanation
boolean	"true" or "false"
durationMicroseconds^*	A duration of time
durationMilliseconds^*	A duration of time
durationSeconds^*	A duration of time
enum[name1, ...]	A enumeration of the specified names
float^*	A decimal number
float_with_units[units1, ...]	"<float> <units>"
int^*	An integer number
int_with_units[units1, ...]	"<int> <units>"
list[type]	A list of the specified type
memorySizeBytes^*	Memory size expressed as one of: byte, bytes, KB, MB or GB
memorySizeKB^*	Memory size expressed as one of: KB, MB, GB or TB
memorySizeMB^*	Memory size expressed as one of: MB, GB, TB or PB
scope	A scope
string^*	A string
table[name1, type1, ...]	A table containing columns of the specified names and types
tuple[name1, type1, ...]	A tuple containing named entries of the specified types
units_with_float[units1, ...]	"<units> <float>"
units_with_int[units1, ...]	"<units> <int>"