Boost logo

Boost :

Subject: Re: [boost] Thoughts for a GUI (Primitives) Library
From: Cory Nelson (phrosty_at_[hidden])
Date: 2010-09-08 05:31:30


On Tue, Sep 7, 2010 at 11:55 PM, Yakov Galka <ybungalobill_at_[hidden]> wrote:
> On Wed, Sep 8, 2010 at 06:28, Artyom <artyomtnk_at_[hidden]> wrote:
>> 2. What strings should be used? std::string, std::wstring, custom string
>>   like Qt's QString or GTKmm's ustring?
>>
> As a windows programmer I say: use UTF-8 with std::string. See Pavel's
> answer here:
> http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful.
> Why UTF-8? In non gui code std::string is more common. Single byte encoding
> is also default for std::exception::what() for example.
> Anyway if there is no consensus on this topic you still can use both through
> a configuration (typedef std::string/std::wstring tstring;).

UTF-8 is variable-length encoded. Some people find that inconvenient,
because they need to search for substrings instead of simple code
units.

UTF-16 is variable-length encoded. Some people like to treat it as
fixed-length, and they are doing it wrong. The Windows API was
originally built with UCS-2 in mind -- it only later got UTF-16
support added. I often wonder if they would have gone that route had
they known UCS-2 would soon run out of uses.

UTF-32 is fixed-length, but it wastes a lot of space for most cultures.

If you are processing Unicode correctly, then you will most likely
need to deal with grapheme clusters (the visible individual characters
that get rendered to your screen). Grapheme clusters can be made of
multiple code points, and there are even different combinations of
code points that create the same grapheme cluster. So for any real
Unicode processing, the encoding does not matter because it is always
effectively variable-length and often not usable with simple
Unicode-ignorant functions like strstr().

I see two use cases with strings:
a) You are using your strings so trivially that it doesn't matter that
they are Unicode or anything else. You're basically just copying
sequences of bytes around that you got from somewhere else, and
probably eventually passing them to a renderer which handles (b).
b) You are processing your strings in a meaningful way, where it
matters that they are Unicode, and it will be unfortunately complex no
matter what you do.

So I'd make the argument that it _does not matter_ what encoding is
used. Make an arbitrary choice!

-- 
Cory Nelson
http://int64.org

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk