Boost logo

Boost :

From: Eric Niebler (eric_at_[hidden])
Date: 2004-10-20 14:48:31


Erik Wien wrote:
> The iterators used are bidirectional, not random access (impossible on UTF-8
> and UTF-16)

No. Andrei Alexandrescu explained a scheme to me whereby a UTF-16
encoded string can have a random-access iterator, and I think it should.
The basic idea is you keep a plain array of 16-bit integers which are
the 16-bit characters and the first 16 bits of surrogate pairs. Then you
have a data structure which maps from string offsets to the second 16
bits of surrogate pairs. Random access involves a simple index and a map
look-up. Sequential access requires no map look-up. And since surrogate
pairs are very rare, the map will almost always be empty and the look-up
is skipped.

I think the default should be UTF-16 encoding, and that the iterator
should use a scheme like this to be random access. Rationale: there are
string algorithms that benefit from random access (Boyer-Moore comes to
mind).

-- 
Eric Niebler
Boost Consulting
www.boost-consulting.com

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk