Chapter 6 Parsers and Lexical Analysers

6.1 Introduction

In this chapter, I discuss the approach taken to implement the parsers and lexical analysers that are used in Config4*.

6.2 Avoidance of Parser Generators

There are many tools available that can generate a lexical analyser or parser. However, I decided to write those parts of Config4* by hand. I did this for two reasons.

First, when I was an undergraduate student at university, I learned how to use the UNIX tools lex (for generating a lexical analyser) and yacc (for generating a parser). But I also learned how to write a lexical analyser and recursive-descent parser by hand. I prefer the hand-written approach, at least for small grammars.

Second, one of my hopes is that people will volunteer to implement Config4* in other programming languages. It should be straightforward to port a hand-written lexical analyser and parser to another (object-oriented or procedural) programming language. This is because those components are implemented with just “normal code”, so the port from C++ to, say, Ada would be just a port of “normal code”.

In contrast, if I had used, say, lex and yacc in the C++ implementation, and you wanted to implement an Ada version, then you would have had to do the following.

Find lex- and yacc-like tools for Ada.
Translate the lex and yacc files into the syntax required for their counterparts in Ada. Doing this would probably require you to learn how to use lex and yacc and their Ada counterparts.
If there is not a one-to-one match of features in lex and yacc, and their Ada counterparts, then you would have to to figure out how to emulate some of the lex and yacc functionality in their Ada counterparts. If you did this incorrectly, then you might introduce subtle bugs in the lexical analyser or parser.

The above steps would introduce complexity that does not exist with a hand-written lexical analyser and parser.

6.3 Lack of Error Recovery

The parsers in many compilers implement error recovery. This means that when the parser encounters an error, it reports the error and then tries to recover by skipping input until the parser encounters, say, the next semicolon (indicating the end of a statement) or close brace (indicating the end of a scope). Having recovered, the parser can examine the rest of the input file to check if it contains any additional errors. Error recovery enables a single run of a compiler to report several errors. This can speed up software development, especially if the compiler is slow.

To keep the Config4* parser simple, it does not make any attempt to do error recovery. Instead, when the parser encounters an error, it reports the error and stops immediately. The lack of error recovery simplifies the design of the Config4* parser. And because the parser is extremely fast, I do not feel it is particularly burdensome for a user to fix one error at a time and then attempt to re-parse a configuration file.

6.4 A Hierarchy of Lexical Analysers

The Config4* library provides parsers—and, hence, lexical analysers—for two distinct languages: (1) the configuration language; and (2) the schema language. There is a lot of overlap between the lexical analysers used to implement the two parsers. For example, the lexical analysers recognise string literals, identifiers and punctuation symbols (such as "=" and ",") in the same way. However, each lexical analyser must recognise distinct sets of keywords and function names.

To avoid code bloat and simplify maintenance, the lexical analysers are implemented as a class hierarchy that consists of a base class plus two sub-classes. The base class, LexBase, implements almost all the functionality required of the two lexical analysers. However, this base class is not hard-coded with details of keywords or function names. Instead, two of its instance variables are pointers/references to arrays that provide information about keywords and function names. The constructors of the ConfigLex and SchemaLex subclasses simply initialise those arrays with information about the keywords and function names specific to a particular language.

6.5 Parsing @if-then-@else statements

Config4* contains a bug in how @if-then-@else statements are parsed. In this section, I start by explaining a subtle problem that underpins the bug. Afterwards, I explain the buggy approach I took to addressing the problem.

6.5.1 A Subtle Problem

Consider the following configuration file:

@if (osType() == "unix") {
    # then part
    install_dir = getenv("FOO_HOME");
    log_dir = install_dir + "/logs/" + exec("hostname");
} @else {
    # else part omitted for brevity
    ...
}

Let’s assume the above file is parsed on a Windows-based computer, so the condition evaluates to false. Because of this, the parser must ignore all the statements in the “then” part of the @if-then-@else statement, and instead process all the statements in the “else” part.

But what is meant by ignore? Obviously, the parser must not update the Configuration object with name=value pairs for install_dir or log_dir.

Actually, ignore implies more than just “do not update”. It also implies “do not query” the Configuration object. To understand why, consider the second assignment statement in the above configuration file. When parsing that statement, the parser must not try to query the value of the install_dir variable because the “do not update” behaviour means that that variable does not exist.

Ignore also implies that function calls should not be evaluated. There are two reasons for this. The first reason is optimisation—for example, it would be a waste of CPU cycles to execute the hostname command and capture its output if that output will then be silently ignored. The other reason is that the parameter passed to a function might be a variable that (because of the “do not update” behaviour) might have not been assigned a value.

Ideally, when the parser is ignoring the “then” part of the @if-then-@else statement, the only thing it should be doing is making sure there are no syntax errors in the statements being ignored. Unfortunately, the grammar used by the parser makes this difficult to do in an efficient and easy-to-code manner. To understand why, consider the following abridged assignment statements:

foo = "hello" + ... ;
foo = getenv( ... ) + ... ;
foo = [ ... ;
foo = split( ... ) + ... ;
foo = bar + ... ;

The parser processes each of the above statements as follows.

First, the parser obtains the foo identifier token from the lexical analyser. At this point, the parser does not know if the statement is an assignment statement or the opening of a scope. To find out, the parser obtains the next token from the lexical analyser. If that next token is an open brace ("{"), then the parser knows it is parsing the opening of a scope. Conversely, if the token is an assignment operator ("=" or "?="), then the parser knows it is parsing an assignment statement.

Now that the parser knows it is dealing with an assignment statement, it needs to determine if the assignment operator is followed by a string expression or a list expression. To do this, the parser obtains the next token from the lexical analyser. If that next token is a string literal (such as "hello") or the name of a function that returns a string (such as "getenv("), then the parser knows it must parse a string expression. Conversely, if the token is "[" or the name of a function that returns a list (such as "split("), then the parser knows it must parse a list expression.

So far, all that I have explained above is straightforward. However, a problem arises if the token after the assignment statement is an identifier (such as bar). In this case, the parser must invoke type() on the Configuration object to determine the type of the variable specified by the identifier. Unfortunately, when the assignment statement is nested inside the non-executed branch of an @if-then-@else statement, the “do not query” behaviour prevents the parser from being able to invoke type().

6.5.2 An Imperfect Approach to Tackling the Problem

The approach taken by the Config4* parser to ignore the statements in a non-executed part of an @if-then-@else statement is as follows.

The parser goes into a loop, in which it consumes tokens from the lexical analyser until it finds the closing brace ("}") of that part of the @if-then-@else statement. The skipToClosingBrace() operation of the ConfigParser class implements that logic.

That approach offers the benefit of being trivial to implement. Unfortunately, it has a significant flaw: it allows syntax errors to go undetected. An example of this is shown in the configuration file below.

@if (osType() == "unix") {
    greeting == "Hello, UNIX user"; # syntax error
} @else {
    greeting = "Hello, Windows user";
}

That file contains a syntax error: the first assignment statement mistakenly uses "==" (the equality testing operator) instead of "=" (an assignment operator). The Config4* parser will not report that error if the configuration file is used on Windows. It is only when a person later uses the configuration file on a UNIX machine that the syntax error will be reported.

Could Config4* be modified to use a more intelligent way of ignoring the non-executed part of an @if-then-@else statement? Probably, but I suspect it would involve significant changes to the parser, and those changes would introduce a lot of complexity.