Boost logo

Boost :

Subject: Re: [boost] Boost.Locale (was Re: [SQL-Connectivity] Is Boost interested in CppDB?)
From: Artyom (artyomtnk_at_[hidden])
Date: 2010-12-15 12:42:25


> From: Matus Chochlik <chochlik_at_[hidden]> > On Tue, Dec 14, 2010 at 8:25 PM, Mathias Gaunard > > My library can do that kind of conversion with arbitrary ranges, and > > possibly lazily as it is being iterated. > > > > Artyom's library can probably do it too, but only eagerly and with > > contiguous memory segments. > > > > + Eager / lazy iteration and traversing noncontiguous sequences > are cool > > [...] > > Another thing is some kind of adaptor for std::(w)string providing >begin()/end() > functions returning an iterator traversing through the code points instead > of utf-XY "chars". i.e. in C++0x: > > std::string s = get_utf8_string(); > auto as = adapt(s); > auto i = as.begin(), e = as.end(); > while(i != e) > { > char32_t c = *i; > ... > *i = transform(c); > ++i; > } > That is exactly the reason Boost.Locale does not provide iteration over code points... What kind of transform(c) you want to do? See... Usually code points are meaningless in context of natural text processing, you generally need higher units: Examples: 1. How many characters where "שָלוֹם" - there are 4 chracters and 6 code points (4 base letters+2 diacritics). Code point!= character and this is why you do not need "indexing" over code points unless you develop some Unicode algorithm. 2. You are rarely work (transform) stand alone code points. You always use context, even stuff like converting case may change the amount of code points in the string! If you want to split the text into characters, words etc, there is a break iterator that does this for you. Artyom


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk