Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Artyom (artyomtnk_at_[hidden])
Date: 2011-01-18 14:01:10


> > Otherwise you should:
> >
> > 1. Reinvent the string
>
> Or at least wrap it. ;-)
>
> > 2. Reinvent standard library to use new string
>
> Not entirely necessary, for the same reason that very few changes to
> the standard library are needed when you switch from char strings to
> char16_t strings to char32_t strings -- the standard library, designed
> around the idea of iterators, is mostly type-agnostic.
>

Ok... Few things:

1. UTF-32 is waste of space - don't use it unless it is something
   like handling code points (char32_t)
2. UTF-16 is too error prone (See: UTF-16 considered harmful)
3. There is not special type char8_t distinct from char, so you
   can't use it.

> The utf*_t types provide fully functional iterators,

Ok let's thing what do you need iterators for? Accessing "characters"
if so you are most likely doing something terribly wrong as you ignore
the fact that codepoint != character.

I would say such iterator is wrong by design unless you develop
a Unicode algorithm that relates to code point.

> so they'll work
> fine with most library functions, so long as those functions don't care
> that some characters are encoded as multiple bytes. It's just the ones
> that assume that a single byte represents all characters that you have
> to replace, and you'd have to replace those regardless of whether you're
> using a new string type or not, if you're using any multi-byte encoding.
>

Ok...

The paragraph above is inheritable wrong

first of all lets cleanup all things:

> that some characters are encoded as multiple bytes

Characters are not code points.

> the ones that assume that a single byte represents
> all characters

Please I want to make this statement even more clearer

 C H A R A C T E R != C O D E P O I N T

Even in single byte encodings - for examples windows-1255 is single
byte encoding and still my represent a single character using 1, 2 or
3 bytes!

Once again - when you work with string you don't work with them as series
of characters you want with them and text entities - text chunks.

>
> and you'd have to replace those regardless of whether you're
> using a new string type or not, if you're using any multi-byte encoding.
>

No I would not because I don't look at string as on the sequence
of code points - by themselves then are meaningless.

Code points are meaningful in terms of Unicode algorithms
that know how to combine them.

So if you want to handle text chunks you will have to use
some Unicode aware library.

>
> > It is just neither feasible no necessary.
>
> My code says it's perfectly feasible. ;-) Whether it's necessary or not
> is up to the individual developer, but the type-safety it offers is
> more in line with the design philosophy of C++ than using std::string
> for everything. I hate to harp on the same tired example, but why do
> you really need any pointer type other than void*? It's the same idea.
>

No it isn't. String is text chunk.

You can combine them, concatenate them, search for specific
substrings or relate to ASCII characters for example like
in HTML and parse them and this is perfectly doable withing
standard std::string regardless it is UTF-8, Latin1 or other
ISO-8859-* ASCII compatible encoding.

This is very different.

Giving you "utf-8" string or UTF-8 container would
give you false feeling that you doing something right.

Unicode is not about splitting string into code points
or iterating over them... It is totally different thing.

Artyom

      


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk