Boost :

Date view	Thread view	Subject view	Author view

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Jens FinkhÃ¤user (jens_at_[hidden])
Date: 2011-01-14 12:00:01

Next message: Bryce Lelbach: "Re: [boost] the <boost/detail/iomanip.hpp> header (please use it)"
Previous message: Jeff Flinn: "Re: [boost] [Process] List of small issues"
In reply to: Peter Dimov: "Re: [boost] [General] Always treat std::strings as UTF-8"
Next in thread: Peter Dimov: "Re: [boost] [General] Always treat std::strings as UTF-8"
Reply: Peter Dimov: "Re: [boost] [General] Always treat std::strings as UTF-8"

On Fri, Jan 14, 2011 at 04:54:05PM +0200, Peter Dimov wrote:
> John B. Turpish wrote:
> - UTF-8 has the nice property that you can do things with a string
> without even decoding the characters; for example, you can sort UTF-8
> strings as-is, or split them on a specific (7 bit) character, such as
> '.' or '/'.
Please excuse me if I'm stating the obvious, but I feel I should
mention that binary sorting is not collation.

    "The basic principle to remember is: The position of characters in
     the Unicode code charts does not specify their sorting weight."
     -- http://unicode.org/reports/tr10/#Introduction

Any application that requires you to present a sorted list of
strings to a user pretty much requires a collation algorithm; in that
sense, the usefulness of the above mentioned property of UTF-8 is
limited.

Again, sorry if I'm stating the obvious here. I've had to bring up
that argument in character encoding related discussions more than
once, and it's become a bit of a knee-jerk response by now ;)

For the application discussed, i.e. for passing strings to OS APIs,
this really doesn't matter, though. Where it does matter slightly is when
deciding whether or not to use UTF-8 internally in your application.

The UCA maps code points to collation elements, or strings into lists
of collation elements, and then binary sorts those collation element
lists instead of the original strings. My guess would be that using
UCS/UTF-32 for that is likely to be cheaper, though I haven't actually
ran any comparisons here. If anyone has, I'd love to know.

All of this is mostly an aside, I guess :)

Jens

-- 
1.21 Jiggabytes of memory should be enough for anybody.

application/pgp-signature attachment: stored

Next message: Bryce Lelbach: "Re: [boost] the <boost/detail/iomanip.hpp> header (please use it)"
Previous message: Jeff Flinn: "Re: [boost] [Process] List of small issues"
In reply to: Peter Dimov: "Re: [boost] [General] Always treat std::strings as UTF-8"
Next in thread: Peter Dimov: "Re: [boost] [General] Always treat std::strings as UTF-8"
Reply: Peter Dimov: "Re: [boost] [General] Always treat std::strings as UTF-8"

Date view	Thread view	Subject view	Author view

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk