Boost logo

Boost :

Subject: Re: [boost] [rfc] Unicode GSoC project
From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2009-05-13 18:35:15


Phil Endecott wrote:

> I would be interested to see this code. I encourage you to share what
> you have done as soon as possible, to get prompt feedback.

I have some code on the boost sandbox svn, but it doesn't implement the
documentation I gave.

It's just some fairly heavyweight iterator adapters will sanity checks
built on top of a general iterator concept, that I probably need to
refine for efficiency.

> Some feedback based on that document:
>
> UTF-16
> ....
> This is the recommended encoding for dealing with Unicode.
>
> Recommended by who? It's not the encoding that I would normally recommend.

The Unicode standard, in some technical notes:
http://www.unicode.org/notes/tn12/
It recommends the use of UTF-16 for general purpose text processing.

It also states that UTF-8 is good for compatibility and data exchange,
and UTF-32 uses just too much memory and is thus quite a waste.

>
> make_utf8(Range&& range);
> Assumes range range is a properly encoded UTF-8 range in
> Normalization Form C.
> Iterating the range may throw an exception if it isn't.
>
> as_utf8(Range&& range);
> Return type is a model of UnicodeRange whose value type is uchar8_t.
>
> To me, the word "make" suggests that the former is actually doing a
> conversion. But it's the latter, "as", that does that. Can we think of
> something better? (Can anyone suggest any precidents?)

I kind of named it randomly.
I also thought of verify_utf8, but wouldn't this be a better name for a
function that eagerly checks the range is valid?

I see three options here:
1) We assume the range is valid and don't bother checking anything
2) We assume the range is valid but still do sanity checks as we go to
avoid raising undefined behaviour.
3) We check the whole range, and know we can use it without any checks
afterwards

Are all three options good to have? Should option 2 do just the checks
it needs or should it assert the whole invariant?
Or should option 2 just be the behaviour of option 1 in debug mode?


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk