Chapter 10 Internationalisation

10.1 Introduction

One aspect of internationalisation is the ability of an application to accept input in multiple human languages, such as English, Greek and Japanese. Nowadays, that ability is usually achieved by providing support for Unicode in the application. Having read some books on Unicode, I have come to two conclusions. First, the Unicode standard has some rough edges that can be irritating. Second, and more frustratingly, Unicode is not implemented widely enough in programming languages. Both of these issues affect Config4*, as I explain in this chapter. However, I expect that some readers may have a poor understanding of Unicode, so I will start by giving an overview of its concepts and terminology.

10.2 Unicode Concepts and Terminology

Unicode 1.0 was defined as a 16-bit character set. This meant it could represent a maximum of 2¹⁶ = 65,536 characters. In Unicode terminology, a code point is a number that denotes a character within a character set. For example, in the ASCII character set, code point 65 denotes the character ‘A’. Thus, Unicode 1.0 had 2¹⁶ = 65,536 code points.

The designers of Unicode 1.0 estimated that supporting all the living languages in the world would require about 16,000 code points, so the 16-bit limit of Unicode 1.0 seemed to be more than sufficient. However, within a few years, they realised that their estimate was too low. That, combined with an emerging desire for Unicode to be able to support ancient languages such as Egyptian Hieroglyphs, meant Unicode had to extend beyond 16-bits.

To accommodate the additional code points (and allow room for future ones), Unicode 2.0 was defined to be a 21-bit character set. Of course, since a 21-bit word size is uncommon in computers, Unicode is normally thought of as being a 32-bit character set (the high-end 11 bits are unused). You might think this means that Unicode can now support a maximum of 2²¹ = 2,097,152 code points. However, some technical details in the Unicode specification mean that parts of the number range are unusable, so Unicode is able to support (only) 1,114,112 code points. Currently (as of Unicode 5.2), 107,361 of these code points have been assigned, so there is still significant room for future expansion.

10.2.1 Planes and Surrogate Pairs

If you want to store a collection of all the 1,114,112 code points in Unicode 2.0, then you could use a single-dimensional array of that size. However, another possibility is to use a two-dimensional array, because 17× 65,536 = 1,114,112. When the Unicode Consortium were extending Unicode beyond 16 bits, it decided to use such a two-dimensional representation. In Unicode terminology, the range of code points is spread across 17 planes, where each plane consists of 2¹⁶=65,536 code points.

The 17 planes are numbered 0..16. Plane 0 contains the 2¹⁶ code points from Unicode 1.0. To enable Unicode 2.0 to expand beyond the 16 bits of Unicode 1.0, 16 code points within plane 0 were reserved for use as escape codes. This makes it possible to represent a code point in plane N with two 16-bit words: the first word specifies the escape code for plane N, and the second word specifies an index into that plane. In Unicode terminology, such a two-word pair is called a surrogate pair; the escape code is called the high surrogate, and the following word is called the low surrogate.

10.2.2 UCS-2, UTF-8, UTF-16 and UTF-32

A surrogate pair is one way to encode a 21-bit Unicode code point, and that encoding format is known as UTF-16.

UCS-2 refers to the “16-bits fixed size” encoding used in Unicode 1.0. Many people mistakenly think that UCS-2 and UTF-16 are the same. The difference between them is subtle: UTF-16 supports surrogate pairs (thus making it possible to support 21-bit code points), while UCS-2 does not support surrogate pairs.

Another Unicode encoding format is UTF-32, which, as its name suggests, encodes a 21-bit code point as a 32-bit integer (the highest 11 bits are unused). Obviously, UTF-32 is a trivial encoding format.

Yet another Unicode encoding format is UTF-8. This uses one byte to encode code points from 0..127, and uses multi-byte escape sequences to encode higher code points. The details of this encoding format are outside the scope of this discussion. The main point to note is that UTF-8 uses between 1 and 4 bytes to encode a code point; the higher the code point, the more bytes are required to encode it.

10.2.3 Merits of Different Encodings

There is no “obviously best” encoding for Unicode. Instead, each encoding has benefits and drawbacks.

UTF-32.

The main benefit of this encoding is the convenience for programmers of knowing that a codepoint is always represented in a fixed-size amount of memory (a 4-byte integer). Because of this, programmers do not have to worry about correctly handling surrogate pairs (in UTF-16) or multi-byte escape sequences (in UTF-8).

The main drawback is the amount of RAM or disk space required to store UTF-32 strings. Some people hold the viewpoint that RAM and disk space are getting exponentially cheaper, so concerns about space inefficiency will gradually reduce over time. I partially agree with that sentiment. However, although the capacity of disk drives increases rapidly from year to year, the bandwidth available for transferring files to/from a disk (or across a network) rises more slowly. Because of this, it is beneficial to use a compact encoding when storing Unicode text in files or transferring them across a network.

UTF-16.

All the code points required to support the majority of the world’s living languages are contained in plane 0. Notable exceptions include Chinese, Japanese and Korean (often abbreviated to CJK). Plane 0 does not encode all the ideographs used in CJK, but it encodes most of the commonly used ones. Because of this, surrogate pairs tend to be used infrequently in most UTF-16 strings. Thus, one significant benefit of UTF-16 is that strings encoded in it usually require about half as much space as strings encoded in UTF-32.

Another benefit of UTF-16 is that writing code to deal with the possibility of surrogate pairs is easier than writing code to deal with the possibility of multi-byte escape sequences (in UTF-8). Because of the above two benefits, UTF-16 is commonly perceived as providing a better “size versus complexity” trade-off than either UTF-8 or UTF-32.

UTF-8.

A string encoded in UTF-8 is guaranteed to be no longer than (and is typically much shorter than) the same string encoded in UTF-32. The same is not true when comparing UTF-8 to UTF-16.

Whether a string encoded in UTF-8 consumes less memory or more memory than the same string encoded in UTF-16 depends on the language used in the string. For example, UTF-8 usually requires just one or two bytes to encode a character used in a Western language, but two or three bytes to encode a character used in an Eastern language. Despite this uncertainty of the space efficiency of UTF-8 versus UTF-16, UTF-8 is commonly perceived as being the most space-efficient encoding of Unicode.

Another benefit of UTF-8 is that it works well with byte-oriented networking protocols.

A drawback of UTF-8 is the complexity involved in writing code that correctly deals with multi-byte escape sequences.

UCS-2.

It is best to avoid UCS-2 when writing new applications. Some programmers working with UTF-16 write code that does not handle surrogate pairs. In effect, this means that their applications can handle only UCS-2 rather than UTF-16. Sometimes, such programmers will claim that this limitation is acceptable because (they mistakenly believe that) plane 0 encodes all the characters of all the world’s living languages, and they do not feel it is important for their applications to support, ancient languages, such as Egyptian Hieroglyphs. However, that assumption about plane 0 is incorrect: some living languages contain characters that are encoded outside plane 0.

10.2.4 Transcoding

If a programming language supports Unicode, then it is likely that the language provides native support for one of UTF-8, UTF-16 or UTF-32.¹ It is common for the programming language to provide utility functions for converting between its native Unicode encoding and other character set encodings.

The term transcoding is commonly used to refer to the conversion of a string from one character-set encoding to another. In some programming languages with Unicode support, transcoding takes place automatically during file input/output.

For example, when an application reads a text file, the file contents are transcoded from the character set specified by the computer’s locale into the programming language’s internal Unicode format. The application then processes the text in the Unicode encoding. Finally, when the application writes the text back out to file, the text is automatically transcoded from the programming language’s Unicode format back into the character-set encoding specified by the computer’s locale.

Programming languages that automatically transcode during file input/output try to achieve the best of both worlds: they provide a programmer-friendly encoding (typically UTF-16 or UTF-32) to manipulate strings in RAM, and a space-efficient encoding (perhaps UTF-8) when transferring to/from disk or across a network.

10.3 Unicode Support in Java

Java has always supported Unicode through its 16-bit char type. The first version of Java was released in January 1996, which was during the final months of Unicode 1.x. Because of this, Java supported UCS-2 initially.

Version 2.0 of Unicode, which defined UTF-16 and UTF-32, was released in July 1996. However, Java continued to support just UCS-2 for another eight years. Java 5.0, released in September 2004, finally upgraded Java’s Unicode support from UCS-2 to UTF-16. To provide support for UTF-16, the Character and String classes were extended with new operations to identify surrogate pairs, to convert a surrogate pair into a 32-bit code point, and to manipulate code points. For example, the Character.isLetter() operation is now overloaded to take either a 16-bit value (a Java char) or a 32-bit code point.

The main place in Config4J’s source code where Unicode support arises is the lexical analyser. In particular, the lexical analyser calls Character.isLetter() to help it determine if a character is part of an identifier. There are two obvious ways to handle this in the lexical analyser.

Approach 1.: The lexical analyser could ignore the possibility of surrogate pairs. Doing this would mean that Config4J could be compiled with relatively new compilers (Java 5.0 and later) and also with older compilers. However, by failing to correctly handle surrogate pairs, Config4J would be restricted to working with UCS-2 rather than UTF-16.
Approach 2.: The lexical analyser could make direct use of operations (introduced in Java 5.0) that support surrogate pairs and 32-bit code points. Doing this would enable Config4J to support UTF-16, but would make it impossible for people to compile Config4J with older (pre-Java 5.0) compilers.

There is another, but non-obvious, way for the lexical analyser to handle Unicode issues.

Approach 3.

The lexical analyser could use reflection to determine if the surrogate pair- and code point-related operations of Java 5.0 are available. If those operation are available, then the lexical analyser would use reflection to invoke them, and thus Config4J would support UTF-16. Conversely, if those operations are not available, then the lexical analyser would not attempt to invoke them, and hence Config4J would gracefully degrade to supporting UCS-2. This approach would offer the best of the two previous approaches: Config4J could be compiled with both old and new compilers, and it would support UTF-16 if the Java runtime environment does. There are two minor drawbacks to this approach.

First, invoking operations via reflection is more complex than invoking them directly. Thus, the lexical analyser would be harder to write and maintain. However, the use of reflection would be very localised, so the complexity introduced would be minimal.

Second, invoking operations via reflection is slower than invoking them directly. However, most of the Java 5.0-specific operations required are trivial enough to be reverse engineered and implemented inside Config4J, so they could be invoked without the need to use reflection. Doing that would ensure that the performance overhead of using reflection would be incurred only when a surrogate pair was encountered, which is likely to be very infrequently.

Currently, Config4J uses approach 1. I would like to see Config4J enhanced to support approach 3. I have not implemented approach 3 yet due to a combination of reasons. First, I wanted to release a “good enough” initial version of Config4J and defer improvements for a later release (rather than defer an initial release until Config4J was perfect). Second, I prefer to not write code for working with surrogate pairs unless I can test that code properly and, unfortunately, at the moment I do not have a good way to create, say, CJK-based configuration files that can be used for testing.

10.4 Unicode Support in C and C++

Ideally, I would like Config4Cpp to have the following properties.

Not be limited to working with 8-bit characters, such as those in, say, English, but rather support the use of characters in arbitrary languages. Since much of the world is converging on Unicode for such support, Config4Cpp should support Unicode.
Be portable across different C++ compilers and different operating systems.
Rely on only the standard C library.

In this section, I explain the challenges that make achieving all the above very difficult, if not impossible.

10.4.1 Limitations in the Standard C Library

UTF-8 is an example of a multi-byte character encoding: it uses one or more bytes to encode each character.

UCS-2 and UTF-32 are examples of wide character encodings: they use fixed-size integers (16 or 32 bits) to represent each character.

UTF-16 is a bit unusual. Its support for surrogate pairs means that it is not a wide (that is, fixed-size) character encoding. Likewise, it is not a multi-byte encoding since its basic unit is a 16-bit word rather than an 8-bit byte.

The C and C++ programming languages define the char type, which is usually associated with single-byte character encodings, such as ASCII or the ISO-Latin-N family of encodings. However, the char type can also be used with multi-byte encodings, such as UTF-8.

C and C++ also define the wchar_t type, which is for use with wide character encodings.² The C and C++ language specifications do not define the size of wchar_t; the specifications merely state that wchar_t is wide enough to hold all code points in the wide character encodings supported by the compiler and its runtime libraries.

The width of wchar_t might be as little as 8 bits. That statement might seem like an oxymoron, but it makes sense if you consider a compiler that is developed for use with an embedded system. Such systems typically have a limited amount of RAM and may not have a requirement to support internationalisation. In such a scenario, the wchar_t type and its supporting functions might be implemented as placebo wrappers around the char type and its supporting functions.
If the width of wchar_t is 16 bits, then this will be sufficient for UCS-2 and some non-Unicode encodings.
If the width of wchar_t is 32 bits, then this will be sufficient for UTF-32 and some non-Unicode encodings.

An important point is that UTF-16 cannot be supported by a 16-bit wide wchar_t; at least, not without resorting to third-party (and probably proprietary) functions for dealing with surrogate pairs. This is important because Microsoft Windows (mis)uses a 16-bit wide wchar_t type with UTF-16. Thus, if you are writing Unicode-aware applications on Windows, then you are forced to use functions outside of the standard C library to deal with surrogate pairs. This can make it difficult to write Unicode-aware applications that are portable between Windows and UNIX-based operating systems, most of which use a 32-bit wide wchar_t type. One way to maintain cross-platform portability is to not attempt to deal with surrogate pairs; this will result in your application supporting UTF-32 on UNIX but only UCS-2 on Windows.

Another portability problem is that a 32-bit wide wchar_t type and its supporting functions might use UTF-32, or they might use a non-Unicode encoding. For example, whether or not UTF-32 is used by wchar_t on Solaris depends on a user-specified locale setting, and incorrectly assuming that UTF-32 is always used can result in application bugs.³

10.4.2 Use of Third-party Unicode Libraries

Having an operating system or programming language provide built-in support for Unicode is desirable, but it is not strictly necessary. This is because there are third-party libraries (both open- and closed-source) that provide Unicode support. However, these libraries tend to be many MB in size. For example, the C/C++ version of the ICU⁴ library occupies about 22MB when built for Linux. Libraries that implement Unicode are large because they provide a lot of functionality that is driven by large tables. For example:

The library must provide a set of properties for each Unicode code point. The properties include the official human-readable name for the code point and its category (an upper-case letter, a lower-case letter, a digit, a punctuation character, and so on). If it is an upper-case letter, then the code point of the corresponding lower-case letter is stored (and vice versa); that information is required to implement utility functions such as toUpper() and toLower().
The library might also provide tables to support transcoding (see Section 10.2.4) between Unicode and other character set encodings. For example, ICU provides over 300 transcoding tables.

Config4Cpp could be modified so that it uses ICU (or some other Unicode library). Doing that would provide Config4Cpp with portable Unicode support. However, Currently, the Config4Cpp library occupies a few hundred KB. Modifying Config4Cpp to use ICU would add an extra 22MB to its memory footprint. That increase in required memory may be acceptable on many desktop and server machines, but it would make Config4Cpp too heavyweight for use in embedded systems.

10.4.3 UTF-8, UTF-16 or UTF-32?

As a thought experiment, let’s assume we decide to modify Config4Cpp to support UTF-32. Obviously, the public API of Config4Cpp will have to change. For example, the signatures of the insert<type>() and lookup<type>() operations will change to accept UTF-32 strings rather than 8-bit strings (and those UTF-32 strings would then be stored in the internal hash tables).

The UTF-32 version of Config4Cpp will be convenient for programmers who are developing applications that use UTF-32. However, it will not be convenient for programmers who work with UTF-16 or UTF-8. Such programmers will have to frequently transcode (that is, convert) between UTF-32 strings (used by Config4Cpp) and UTF-8 or UTF-16 strings (used by other parts of their applications). The details of how to transcode between UTF-8, UTF-16 and UTF-32 can be found easily with an Internet search, so implementing such transcoding functions will not be difficult. However, there are two problems. First, littering application code with calls to those transcoding functions will decrease code readability and maintainability. Second, repeated transcoding will impose a performance overhead.

Those two problems (decreased readability and a performance overhead) are not unique to using UTF-32. Those same problems will arise whenever different parts of an application use different character encodings. So, it doesn’t really matter if Config4Cpp uses UTF-8, UTF-16, UTF-32 or a locale-specified encoding: the encoding used by Config4Cpp will be convenient for some application developers, and inconvenient for others.

The lack of an “obviously right” Unicode encoding choice affects C/C++ applications because those language specifications do not define built-in support for Unicode. This issue does not arise in, say, Java, because the designers of Java chose to standardise on a specific Unicode encoding.

10.4.4 Approach Currently Used in Config4Cpp

I do not have sufficient Unicode experience to be able to make an informed decision about which character encoding would provide the “lesser of all evils” for use in Config4Cpp. Because of this, I decided to use only the features available in the standard C library. This works as follows.

All the operations in Config4Cpp’s public API work with C-style strings, that is, char*. Config4Cpp assumes those strings are encoded according to the locale in effect, which might be a single-byte encoding (for example, ASCII or ISO-Latin-N) or a multi-byte encoding (for example, UTF-8).
The standard C library provides functions, such as isalpha() and isdigit() to determine the category of a char. Those functions work reliably for a single-byte encoding. However, to deal with the possibility of a multi-byte encoding, it is necessary to use mbstowcs() to transcode the stream of char into a stream of wchar_t and then use iswalpha() and iswdigit() to check the category of a character. That approach is used by Config4Cpp’s lexical analyser so that it can correctly identify the characters permitted in identifiers.
Once the lexical analyser has identified a (potentially multi-byte) character’s category, it discards the wchar_t. The name=value entries in Config4Cpp’s hash tables are stored as C-style strings, that is, as null-terminated arrays of char. The encoding used in those strings is the encoding specified by the locale in effect.

The approach described above is convenient for application developers who work with C-style strings. It is also convenient for developers who know that the locale uses the UTF-8 encoding. Developers who prefer, say, UTF-16 or UTF-32 will have to transcode between the locale’s encoding and their preferred encoding.

I do not claim that this approach is ideal. Rather, I view it as a temporary measure until a better approach can be determined. In particular, I hope that the open-sourcing of Config4* will result in internationalisation experts within the open source community offering advice on how to improve this aspect of Config4Cpp.

1: An exception is the D programming language, which provides distinct data-types for each of UTF-8, UTF-16 and UTF-32 (http://en.wikipedia.org/wiki/D_%28
programming_language%29#String_handling).
2: In C, wchar_t is a typedef name (defined in <stddef.h>) for an integral type, while in C++ wchar_t is a keyword.
3: http://defect.opensolaris.org/bz/show_bug.cgi?id=11076#c14
4: ICU (International Components for Unicode) is an open-source library, available in C/C++ and Java flavours. It is hosted at http://site.icu-project.org/.