Boost logo

Boost :

From: Eric Niebler (eric_at_[hidden])
Date: 2004-10-20 15:37:19


Peter Dimov wrote:
> Eric Niebler wrote:
>
>> Erik Wien wrote:
>>
>>> The iterators used are bidirectional, not random access (impossible
>>> on UTF-8 and UTF-16)
>>
>>
>>
>> No. Andrei Alexandrescu explained a scheme to me whereby a UTF-16
>> encoded string can have a random-access iterator, and I think it
>> should. The basic idea is you keep a plain array of 16-bit integers
>> which are the 16-bit characters and the first 16 bits of surrogate
>> pairs. Then you have a data structure which maps from string offsets
>> to the second 16 bits of surrogate pairs. Random access involves a
>> simple index and a map look-up. Sequential access requires no map
>> look-up. And since surrogate pairs are very rare, the map will almost
>> always be empty and the look-up is skipped.
>
>
> Nice! But this seems to make c_str O(N) operation. If I need to speak to
> a library in the common extern "C" language of interoperability, and
> that library happens to need UTF-16 encoded wchar_t const [], which by
> coincidence has the same representation as char16_t const [], I won't be
> very happy if The C++ string seems to ignore this common scenario.

Two points. First, keep in mind that surrogates are exceedingly rare.
The common case is that there are no surrogates, and c_str is O(1).
Second, in the rare case where there are surrogates, there can be a
mutable cache that c_str can return, building it on demand only when the
cache is dirty.

IMO the advantages of having a random access iterator are worth the
trouble, especially considering how rare surrogates are.

Oh, and I agree that it should be a const iterator. :-)

-- 
Eric Niebler
Boost Consulting
www.boost-consulting.com

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk