Boost logo

Boost :

Subject: Re: [boost] Thoughts for a GUI (Primitives) Library
From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2010-09-09 10:46:08


On 08/09/10 10:31, Cory Nelson wrote:
> I see two use cases with strings:
> a) You are using your strings so trivially that it doesn't matter that
> they are Unicode or anything else. You're basically just copying
> sequences of bytes around that you got from somewhere else, and
> probably eventually passing them to a renderer which handles (b).
> b) You are processing your strings in a meaningful way, where it
> matters that they are Unicode, and it will be unfortunately complex no
> matter what you do.

In practice, a) on UTF-8 data that has been normalized with NFC works
well enough so that it's understandable that people don't want to unroll b).

Caring about combining character sequences, grapheme clusters or even
words or anything else (later referenced to with 'whatever') isn't much
more complex though.
The only thing that could break is substring search, and there are
typically two solutions:
- deal with ranges of whatever instead of ranges of code units
- deal with ranges of code units, then ignore matches that don't lie on
the right whatever boundaries.

The two approaches work, but have different performance characteristics.
I haven't benchmarked, but I would expect the second one to be
significantly faster on real data, even if the first one seems like the
most natural solution.

My unicode library provides both approaches.

> So I'd make the argument that it _does not matter_ what encoding is
> used. Make an arbitrary choice!

One slight advantage of UTF-16/UTF-32 is that you could possibly use
wide string literals to portably input data, if you've got your compiler
set up correctly, while you cannot do that with regular string literals
unless your locale is also utf-8 (which isn't possible on windows).

I guess the best choice to make people happy is to allow any range for
input and deduce the encoding according to the value_type, and return
utf-8 ranges for output.
Input/output here refers to what you give the API and what the API gives
back to you.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk