Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Jens Finkhäuser (jens_at_[hidden])
Date: 2011-01-14 12:00:01


On Fri, Jan 14, 2011 at 04:54:05PM +0200, Peter Dimov wrote:
> John B. Turpish wrote:
> - UTF-8 has the nice property that you can do things with a string
> without even decoding the characters; for example, you can sort UTF-8
> strings as-is, or split them on a specific (7 bit) character, such as
> '.' or '/'.
  Please excuse me if I'm stating the obvious, but I feel I should
mention that binary sorting is not collation.

    "The basic principle to remember is: The position of characters in
     the Unicode code charts does not specify their sorting weight."
     -- http://unicode.org/reports/tr10/#Introduction

  Any application that requires you to present a sorted list of
strings to a user pretty much requires a collation algorithm; in that
sense, the usefulness of the above mentioned property of UTF-8 is
limited.

  Again, sorry if I'm stating the obvious here. I've had to bring up
that argument in character encoding related discussions more than
once, and it's become a bit of a knee-jerk response by now ;)

  For the application discussed, i.e. for passing strings to OS APIs,
this really doesn't matter, though. Where it does matter slightly is when
deciding whether or not to use UTF-8 internally in your application.

  The UCA maps code points to collation elements, or strings into lists
of collation elements, and then binary sorts those collation element
lists instead of the original strings. My guess would be that using
UCS/UTF-32 for that is likely to be cheaper, though I haven't actually
ran any comparisons here. If anyone has, I'd love to know.

  All of this is mostly an aside, I guess :)

Jens

-- 
1.21 Jiggabytes of memory should be enough for anybody.



Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk