|
Boost : |
Subject: Re: [boost] Boost.Locale (was Re: [SQL-Connectivity] Is Boost interested in CppDB?)
From: Artyom (artyomtnk_at_[hidden])
Date: 2010-12-15 12:42:25
> From: Matus Chochlik <chochlik_at_[hidden]>
> On Tue, Dec 14, 2010 at 8:25 PM, Mathias Gaunard
> > My library can do that kind of conversion with arbitrary ranges, and
> > possibly lazily as it is being iterated.
> >
> > Artyom's library can probably do it too, but only eagerly and with
> > contiguous memory segments.
> >
>
> + Eager / lazy iteration and traversing noncontiguous sequences
> are cool
>
> [...]
>
> Another thing is some kind of adaptor for std::(w)string providing
>begin()/end()
> functions returning an iterator traversing through the code points instead
> of utf-XY "chars". i.e. in C++0x:
>
> std::string s = get_utf8_string();
> auto as = adapt(s);
> auto i = as.begin(), e = as.end();
> while(i != e)
> {
> char32_t c = *i;
> ...
> *i = transform(c);
> ++i;
> }
>
That is exactly the reason Boost.Locale does not provide iteration
over code points...
What kind of transform(c) you want to do?
See... Usually code points are meaningless in context of
natural text processing, you generally need higher units:
Examples:
1. How many characters where "שָ××Ö¹×" - there are 4 chracters and
6 code points (4 base letters+2 diacritics). Code point!= character and this
is why you
do not need "indexing" over code points unless you develop
some Unicode algorithm.
2. You are rarely work (transform) stand alone code points.
You always use context, even stuff like converting case
may change the amount of code points in the string!
If you want to split the text into characters, words etc, there is a break
iterator
that does this for you.
Artyom
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk