Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Chad Nelson (chad.thecomfychair_at_[hidden])
Date: 2011-01-15 10:08:22

On Fri, 14 Jan 2011 10:59:09 -0500
Dave Abrahams <dave_at_[hidden]> wrote:

> At Fri, 14 Jan 2011 17:50:02 +0200,
> Peter Dimov wrote:
>> Unfortunately not. A library that requires its input paths to be
>> UTF-8 always gets bug reports from users who are accustomed to using
>> another encoding for their narrow strings. There is plenty of
>> precedent they can use to justify their complaint.
> I don't see the problem you cited as an answer to my question. Let me
> try asking it differently: how do I program in an environment that has
> both "right" and "wrong" libraries?
> Also, is there any use in trying to get the difference into the type
> system, e.g. by using some kind of wrapper over std::string that gives
> it a distinct "utf-8" type?

The system I'm now using for my programs might interest you.

I have four classes: ascii_t, utf8_t, utf16_t, and utf32_t. Assigning
one type to another automatically converts it to the target type during
the copy. (Converting to ascii_t will throw an exception if a resulting
character won't fit into eight bits.)

Each type has an internal storage type as well, based on the character
size (ascii_t and utf8_t use std::string, utf16_t uses 16-bit
characters, etc). You can access the internal storage type using
operator* or operator->. For a utf8_t variable 'v', for example, *v
gives you the UTF-8-encoded string.

An std::string is assumed to be ASCII-encoded. If you really do have
UTF-8-encoded data to get into the system, you either assign it to a
utf8_t using operator*, or use a static function utf8_t::precoded.
std::wstring is assumed to be utf16_t- or utf32_t-encoded already,
depending on the underlying character width for the OS.

A function is simply declared with parameters of the type that it
needs. You can call it with whichever type you've got, and it will be
auto-converted to the needed type during the call, so for the most part
you can ignore the different types and use whichever one makes the most
sense for your application. I use utf8_t as the main internal string
type for my programs.

For portable OS-interface functions, there's a typedef (os::native_t)
to the type that the OS's API functions need. For Linux-based systems,
it's utf8_t; for Windows, utf16_t. There's also a typedef
(os::unicode_t) that is utf32_t on Linux and utf16_t on Windows, but
I'm not sure there's a need for that.

There are some parts of the code that could use polishing, but I like
the overall design, and I'm finding it pretty easy to work with. Anyone
interested in seeing the code?

Chad Nelson
Oak Circle Software, Inc.

Boost list run by bdawes at, gregod at, cpdaniel at, john at