Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Chad Nelson (chad.thecomfychair_at_[hidden])
Date: 2011-01-16 21:41:25


On Sun, 16 Jan 2011 12:56:23 -0800 (PST)
Artyom <artyomtnk_at_[hidden]> wrote:

>> The system I'm now using for my programs might interest you.
>>
>> I have four classes: ascii_t, utf8_t, utf16_t, and utf32_t.
>> Assigning one type to another automatically converts it to the
>> target type during the copy. (Converting to ascii_t will throw an
>> exception if a resulting character won't fit into eight bits.)
>>
>
> If so (and this is what I see in code) ASCII is misleading.
> It should be called Latin1/ISO-8859-1 but not ASCII.

Probably, but latin1_t isn't very obvious, and iso_8859_1_t is a little
awkward to type. ;-) As I've said, this code was written solely for my
company, I'd make a number of changes if I were going to submit it to
Boost.

>> An std::string is assumed to be ASCII-encoded. If you really do
>> have UTF-8-encoded data to get into the system, you either assign it
>> to a utf8_t using operator*, or use a static function
>> utf8_t::precoded. std::wstring is assumed to be utf16_t- or
>> utf32_t-encoded already, depending on the underlying character
>> width for the OS.
>
> This is very bad assumption. To be honest, I've written lots of code
> with direct UTF-8 strings in it (Boost.Locale tests) and this worked
> perfectly well with MSVC, GCC and Intel compilers (as long as I work
> with char * not L"") and this works file all the time.
>
> It is bad assumption, the encoding should be byte string which may be
> UTF-8 or may be not.

But if you assigned that byte string to a utf*_t type, how would you
treat it? I had to either make some assumption, or disallow assigning
from an std::string and char* entirely. And it's just too convenient to
use those assignments, for things like constants, to give that up.

The way I designed it, you're supposed to feed it only ASCII (or
Latin-1, if you prefer) text when you make an assignment that way. If
you have some differently-coded text, you'd feed it in through another
class, one that knows its coding and is designed to decode to UTF-32
the way that utf8_t and utf16_t are, so that the templated conversion
functions know how to handle it.

> There are two cases we need to treat strings and encoding:
>
> 1. We handle human language or text - collation, formatting etc.
> 2. We want to access Windows Wide API that is not locale agnostic.

I'm not sure where you're coming from. Those are two broad categories
of uses for that code, but arguably not the only two.

>> For portable OS-interface functions, there's a typedef
>> (os::native_t) to the type that the OS's API functions need. For
>> Linux-based systems, it's utf8_t; for Windows, utf16_t. There's
>> also a typedef (os::unicode_t) that is utf32_t on Linux and utf16_t
>> on Windows, but I'm not sure there's a need for that.
>>
>
> When you work with Linux and Unix at all you should not change
> encoding. There were discussions about it. [...] Using your API it
> would fail as it holds some assumptions on encoding.

Why would you feed "\xFF\xFF.txt" into a utf8_t type, if you didn't
want it encoded as UTF-8? If you have a function that requires some
different encoding, you'd use that encoding instead. For filenames,
you'd treat the strings entered by the user or obtained from the file
system as opaque blocks of bytes.

In any case, all modern Linux OSes use UTF-8 by default, so I haven't
seen any need to worry about other forms yet. I'm not even sure how I'd
tell what code-page a Linux system is set to use, so far I've never
needed to know that. Though if a Russian customer comes along and tells
me my code doesn't work right on his Linux system, I'll re-think that.

>> There are some parts of the code that could use polishing, but I
>> like the overall design, and I'm finding it pretty easy to work
>> with. Anyone interested in seeing the code?
>
> IMHO, I don't think that inventing new strings or new text containers
> is a way to go. std::string is perfectly fine as long as you code in
> consistent way.

I have to respectfully disagree. std::string says nothing about the
encoding of the data within it. If you're using more than one type of
encoding in your program, like Latin-1 and UTF-8, then using
std::strings is like using void pointers -- no type safety, no way to
automate conversions when necessary, and no way to select overloaded
functions based on the encoding. A C++ solution pretty much requires
that they be unique types.

-- 
Chad Nelson
Oak Circle Software, Inc.
*
*
*



Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk