AbiWord Document

Notes on the codecvt implementation.

prepared by Benjamin Kosnik (bkoz@redhat.com) on August 25, 2000

1. Abstract

Around page 425 of the C++ Standard, this charming heading comes into view:

22.2.1.5 - Template class codecvt [lib.locale.codecvt]

The standard class codecvt attempts to address conversions between different character encoding schemes. In particular, the standard attempts to detail conversions between the implementation-defined wide characters (hereafter referred to as wchar_t) and the standard type char that is so beloved in classic "C" (which can now be referred to as narrow characters.)

This document attempts to describe how the GNU libstdc++-v3 implementation deals with the conversion between wide and narrow characters, and also presents a framework for dealing with the huge number of other encodings that iconv can convert, including Unicode and UTF8. Design issues and requirements are addressed, and examples of correct usage for both the required specializations for wide and narrow characters and the implementation-provided extended functionality are given.

2. Intro, ,standard says

2. Some thoughts on what would be useful

Probably the most frequently asked question about code conversion is: "So dudes, what's the deal with Unicode strings?" The dude part is optional, but apparently the usefulness of Unicode strings is pretty widely appreciated. Sadly, this specific encoding (And other useful encodings like UTF8, UCS4, ISO 8859-10, etc etc etc) are not mentioned in the C++ standard.

In particular, the simple implementation detail of wchar_t's size seems to repeatedly confound people. Many systems use a two byte, unsigned integral type to represent wide characters, and use an internal encoding of Unicode or UCS2. (See AIX, Microsoft NT, Java, others.) Other systems, use a four byte, unsigned integral type to represent wide characters, and use an internal encoding of UCS4. (GNU/Linux systems using glibc, in particular.) The C programming language (and thus C++) does not specify a specific size for the type wchar_t.

Thus, portable C++ code cannot assume a byte size (or endianness) either.

Getting back to the frequently asked question: What about Unicode strings?

The text around the codecvt definition gives some clues:

-1- The class codecvt<internT,externT,stateT> is for use when converting from one

codeset to another, such as from wide characters to multibyte characters, between wide

character encodings such as Unicode and EUC.

Hmm. So, in some unspecified way, Unicode encodings and translations between other character sets should be handled by this class.

-2- The stateT argument selects the pair of codesets being mapped between.

Ah ha! Another clue...

-3- The instantiations required in the Table ?? (lib.locale.category), namely

codecvt<wchar_t,char,mbstate_t> and codecvt<char,char,mbstate_t>, convert the

implementation-defined native character set. codecvt<char,char,mbstate_t> implements

a degenerate conversion; it does not convert at all. codecvt<wchar_t,char,mbstate_t>

converts between the native character sets for tiny and wide characters. Instantiations on

mbstate_t perform conversion between encodings known to the library implementor.

Other encodings can be converted by specializing on a user-defined stateT type. The

stateT object can contain any state that is useful to communicate to or from the

specialized do_convert member.

At this point, the initial design of the library becomes clear:

3. How to accomplish this: partial specialization with and iconv wrapper class, __enc_traits.

4. Design

a. goals.

b. drawbacks

c. things that are sketchy

5. Examples

a. conversions involving string literals

b. conversions invollving std::string

c. conversions involving std::filebuf and std::ostream

6. Acknowledgments

Ulrich Drepper for the iconv suggestions and patient question answering, Jason Merrill for the template partial specialization hints and wchar_t fixes, etc etc etc.

7. Bibliography / Referenced Documents

ISO/IEC 14882:1998 Programming languages - C++

ISO/IEC 9899:1999 Programming languages - C

glibc-2.2 docs

System Interface Definitions, Issue 6 (IEEE Std. 1003.1-200x)

The Open Group/The Institute of Electrical and Electronics Engineers, Inc.

http://www.opennc.org/austin/docreg.html

Appendix D, The C++ Programming Language, Special Edition, Bjarne Stroustrup, Addison Wesley, Inc. 2000

Standard C++ IOStreams and Locales, Advanced Programmer's Guide and Reference, Angelika Langer and Klaus Kreft, Addison Wesley Longman, Inc. 2000

Numerous, late-night email correspondence with Ulrich Drepper (drepper@redhat.com).

+ Notes on the codecvt implementation. +

+1. Abstract +

+The standard class codecvt attempts to address conversions between +different character encoding schemes. In particular, the standard +attempts to detail conversions between the implementation-defined wide +characters (hereafter referred to as wchar_t) and the standard type +char that is so beloved in classic "C" (which can now be referred to +as narrow characters.) This document attempts to describe how the GNU +libstdc++-v3 implementation deals with the conversion between wide and +narrow characters, and also presents a framework for dealing with the +huge number of other encodings that iconv can convert, including +Unicode and UTF8. Design issues and requirements are addressed, and +examples of correct usage for both the required specializations for +wide and narrow characters and the implementation-provided extended +functionality are given. +

+2. What the standard says +

+Hmm. So, in some unspecified way, Unicode encodings and +translations between other character sets should be handled by this +class. +

+One: The standard clearly implies that attempts to add non-required +(yet useful and widely used) conversions need to do so through the +third template parameter, stateT.

+Two: The required conversions, by specifying mbstate_t as the third +template parameter, imply an implementation strategy that is mostly +(or wholly) based on the underlying C library, and the functions +mcsrtombs and wcsrtombs in particular.

+2. Some thoughts on what would be useful +

+In particular, the simple implementation detail of wchar_t's size +seems to repeatedly confound people. Many systems use a two byte, +unsigned integral type to represent wide characters, and use an +internal encoding of Unicode or UCS2. (See AIX, Microsoft NT, Java, +others.) Other systems, use a four byte, unsigned integral type to +represent wide characters, and use an internal encoding of +UCS4. (GNU/Linux systems using glibc, in particular.) The C +programming language (and thus C++) does not specify a specific size +for the type wchar_t. + +

+The thought that all one needs to convert between two arbitrary +codesets is two types and some kind of state argument is +unfortunate. In particular, encodings may be stateless. The naming of +the third parameter as stateT is unfortunate, as what is really needed +is some kind of generalized type that accounts for the issues that +abstract encodings will need. The minimum information that is required +includes: +

+ Notes on the codecvt implementation. +

+1. Abstract +

+2. What the standard says +

+2. Some thoughts on what would be useful +

+3. Problems with "C" code conversions : thread safety, global locales, + termination. +

+4. Design +

+5. Examples +

+6. Unresolved Issues +

+7. Acknowledgments +

+8. Bibliography / Referenced Documents +