Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Chad Nelson (chad.thecomfychair_at_[hidden])
Date: 2011-01-15 10:08:22


On Fri, 14 Jan 2011 10:59:09 -0500
Dave Abrahams <dave_at_[hidden]> wrote:

> At Fri, 14 Jan 2011 17:50:02 +0200,
> Peter Dimov wrote:
>>
>> Unfortunately not. A library that requires its input paths to be
>> UTF-8 always gets bug reports from users who are accustomed to using
>> another encoding for their narrow strings. There is plenty of
>> precedent they can use to justify their complaint.
>
> I don't see the problem you cited as an answer to my question. Let me
> try asking it differently: how do I program in an environment that has
> both "right" and "wrong" libraries?
>
> Also, is there any use in trying to get the difference into the type
> system, e.g. by using some kind of wrapper over std::string that gives
> it a distinct "utf-8" type?

The system I'm now using for my programs might interest you.

I have four classes: ascii_t, utf8_t, utf16_t, and utf32_t. Assigning
one type to another automatically converts it to the target type during
the copy. (Converting to ascii_t will throw an exception if a resulting
character won't fit into eight bits.)

Each type has an internal storage type as well, based on the character
size (ascii_t and utf8_t use std::string, utf16_t uses 16-bit
characters, etc). You can access the internal storage type using
operator* or operator->. For a utf8_t variable 'v', for example, *v
gives you the UTF-8-encoded string.

An std::string is assumed to be ASCII-encoded. If you really do have
UTF-8-encoded data to get into the system, you either assign it to a
utf8_t using operator*, or use a static function utf8_t::precoded.
std::wstring is assumed to be utf16_t- or utf32_t-encoded already,
depending on the underlying character width for the OS.

A function is simply declared with parameters of the type that it
needs. You can call it with whichever type you've got, and it will be
auto-converted to the needed type during the call, so for the most part
you can ignore the different types and use whichever one makes the most
sense for your application. I use utf8_t as the main internal string
type for my programs.

For portable OS-interface functions, there's a typedef (os::native_t)
to the type that the OS's API functions need. For Linux-based systems,
it's utf8_t; for Windows, utf16_t. There's also a typedef
(os::unicode_t) that is utf32_t on Linux and utf16_t on Windows, but
I'm not sure there's a need for that.

There are some parts of the code that could use polishing, but I like
the overall design, and I'm finding it pretty easy to work with. Anyone
interested in seeing the code?

-- 
Chad Nelson
Oak Circle Software, Inc.
*
*
*



Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk