Boost logo

Boost :

Subject: Re: [boost] [unicode] Interest Check / Proof of Concept
From: Zach Laine (whatwasthataddress_at_[hidden])
Date: 2008-11-20 09:08:57


> Eric Niebler wrote:
>>
>> Agree. Thanks Zach. I'm discouraged that every time the issue of a Unicode
>> library comes up, the discussion immediately descends into a debate about
>> how to design yet another string class. Such a high level wrapper *might* be
>> useful (strong emphasis on "might"), but the core must be the Unicode
>> algorithms, and the design for a Unicode library must start there.
>
> Since it seems like there's a lot of concern with making a new string type,
> how about the following (off-the-cuff):
>
> * Iterator filters a la Zach's message:

[snip]

> * Runtime-defined filters:
>
> typedef boost::recoding_iterator<boost::utf16,boost::runtime>
> utf16_to_any_iter;
> boost::runtime *my_codec = /*...*/;
> std::copy(utf16_to_utf8_iter(u_string.begin(), my_codec),
> utf16_to_utf8_iter(u_string.end(), my_codec),
> std::back_inserter(std_string));

Yes, that's what I was thinking as well. In fact, if you look at the
Boost.GIL any_image<> and any_image_view<> templates, you'll see that
they allow the user to specify a limit number of variants (a la
Boost.Variant). So it's more restrictive than a Boost.Any, but that
might be an advantage if it allows you to detect more errors at
runtime. I think that in use cases, one will have knowledge of the
maximum number of encodings that are possible in that case. Just
something to consider.

> * Shorthand for the above two points:
>
> boost::transcode(u_string, boost::utf16(),
> std_string, boost::utf8());
>

Looks good, but is this function an assignment, or an append?

> * String views that can wrap up the encoding type and the data (a container
> of some kind: strings, vector<char>s, ropes, etc):
>
> boost::estring_view<utf8> my_utf8_string(std_string);
> boost::estring_view<> my_rt_string(str, my_codec);
>
> boost::transcode(my_utf8_string, my_rt_string);

Yes. Views are notably absent in my original post. I think views are
essential for encodings that are variable in length (e.g. UTF-8).
Getting the character-location of code point N, or vice versa, and
doing it efficiently, is a must-have.

> Luckily, most of the work I've done is in making the encoding facets
> extensible and chooseable at runtime, so I wouldn't mourn the loss of my
> (frankly none-too-zazzy) string class.

This is just what I was hoping. The bulk of the work you'll do in any
case will probably be with the algorithms and number of supported
encodings.

Zach


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk