Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Chad Nelson (chad.thecomfychair_at_[hidden])
Date: 2011-01-18 20:16:37

On Tue, 18 Jan 2011 11:01:10 -0800 (PST)
Artyom <artyomtnk_at_[hidden]> wrote:

>>> 2. Reinvent standard library to use new string
>> Not entirely necessary, for the same reason that very few changes to
>> the standard library are needed when you switch from char strings to
>> char16_t strings to char32_t strings -- the standard library,
>> designed around the idea of iterators, is mostly type-agnostic.
> Ok... Few things:
> 1. UTF-32 is waste of space - don't use it unless it is something
> like handling code points (char32_t)
> 2. UTF-16 is too error prone (See: UTF-16 considered harmful)

No argument with either assertion.

> 3. There is not special type char8_t distinct from char, so you
> can't use it.

That's why I wrote the utf8_t type. I'd have been quite happy to just
use an std::basic_string<utf8_byte_t>, and I looked into the C++0x
"opaque typedef" idea to see if it was possible. I couldn't find any
elegant way to make it work, and the opaque typedef proposal was
dropped from the spec, so I felt that I had to write the utf8_t class.

However, I'm not sure what point you're trying to make with the above.

>> The utf*_t types provide fully functional iterators,
> Ok let's thing what do you need iterators for? Accessing "characters"
> if so you are most likely doing something terribly wrong as you ignore
> the fact that codepoint != character.

In the current incarnation of the class, the iterators are for
accessing the bytes, to make it trivially compatible with things like

> I would say such iterator is wrong by design unless you develop
> a Unicode algorithm that relates to code point.

If that's needed (and it probably is), it's easy enough to add. It
just wouldn't use the standard begin() and end() functions.

>> so they'll work
>> fine with most library functions, so long as those functions don't
>> care that some characters are encoded as multiple bytes. It's just
>> the ones that assume that a single byte represents all characters
>> that you have to replace, and you'd have to replace those
>> regardless of whether you're using a new string type or not, if
>> you're using any multi-byte encoding.
> Ok...
> The paragraph above is inheritable wrong


> first of all lets cleanup all things:
>> that some characters are encoded as multiple bytes
> Characters are not code points.

A semantic point. Correct, but irrelevant to the argument I was trying
to make.

>> the ones that assume that a single byte represents
>> all characters
> Please I want to make this statement even more clearer
> C H A R A C T E R != C O D E P O I N T
> Even in single byte encodings - for examples windows-1255 is single
> byte encoding and still my represent a single character using 1, 2 or
> 3 bytes!

std::copy, std::mismatch, std::equal, std::search, and several others
would work equally well on UTF-8 strings. Functions that only allow you
to specify a single element to work with, like std::find, would require
a slightly different kind of iterator, one that operated on either
characters or code-points. I don't see how that makes anything in the
quoted paragraph inherently wrong.

> Once again - when you work with string you don't work with them as
> series of characters you want with them and text entities - text
> chunks.

That depends on what you're doing with them. If you're using them as
translations for messages your program is sending out, then your
statement is correct -- you treat them as opaque blobs. But if for
instance you're parsing a file, you want tokens, which *are* merely an
arbitrary series of characters. Or if your program allows the user to
edit a file, you want something that gives you single characters,
regardless of how many bytes or code-points they're encoded in.

>> and you'd have to replace those regardless of whether you're
>> using a new string type or not, if you're using any multi-byte
>> encoding.
> No I would not because I don't look at string as on the sequence of
> code points - by themselves then are meaningless.
> Code points are meaningful in terms of Unicode algorithms that know
> how to combine them.
> So if you want to handle text chunks you will have to use some
> Unicode aware library.

If you want to sort them, properly for the locale you're working with,
you're correct. If you just want to write them out, or edit them, then
barring things like messages in mixed left-to-right and right-to-left
languages, it's fairly simple.

>>> It is just neither feasible no necessary.
>> My code says it's perfectly feasible. ;-) Whether it's necessary or
>> not is up to the individual developer, but the type-safety it
>> offers is more in line with the design philosophy of C++ than using
>> std::string for everything. I hate to harp on the same tired
>> example, but why do you really need any pointer type other than
>> void*? It's the same idea.
> No it isn't. String is text chunk.
> You can combine them, concatenate them, search for specific
> substrings or relate to ASCII characters for example like in HTML and
> parse them and this is perfectly doable withing standard std::string
> regardless it is UTF-8, Latin1 or other ISO-8859-* ASCII compatible
> encoding.
> This is very different.

I'm trying to understand your point, but with no success so far. If you
want something that gives you characters or code-points, then an
std::string has no chance of working in any multi-byte encoding -- a
UTF-whatever-specific type does.

> Giving you "utf-8" string or UTF-8 container would give you false
> feeling that you doing something right.


> Unicode is not about splitting string into code points or iterating
> over them... It is totally different thing.

I'm baffled by this statement. For doing anything interesting, Unicode
or any other encoding *is* about iterating over characters (or
code-points, if that's what you're looking for).

Your point seems to be that the utf*_t classes are actively harmful in
some way that I don't see, and using std::string somehow mitigates that
by making you do more work. Or am I misunderstanding you?

Chad Nelson
Oak Circle Software, Inc.

Boost list run by bdawes at, gregod at, cpdaniel at, john at