Boost logo

Boost :

From: Rogier van Dalen (rogiervd_at_[hidden])
Date: 2004-10-21 10:06:27


On Wed, 20 Oct 2004 12:48:31 -0700, Eric Niebler
<eric_at_[hidden]> wrote:
> I think the default should be UTF-16 encoding, and that the iterator
> should use a scheme like this to be random access. Rationale: there are
> string algorithms that benefit from random access (Boyer-Moore comes to
> mind).

Correct me if I'm wrong. From what I gather from a Google search,
Boyer-Moore is a fast string search algorithm. Why not use the
algorithm on the code units rather than codepoints? UTF-8 and UTF-16
are both not stateful, specifically to allow optimisations such as
this (as well as error recovery).

As was pointed out earlier in this thread, searching for Unicode
characters takes looking at combining characters as well. I think this
will go for many, if not all, algorithms that you can think of: either
they can be made to work with code units, or they must work on
abstract characters, which means a variable-width encoding anyway.
(See the Unicode Standard 4, Section 2.5 for a similar argument for
UTF-16 over UTF-32, even though the latter is fixed-width.)

I'm ready to be proven wrong; however, at this moment at least I
believe that any effort to make UTF-16 randomly accessible is not
useful.

Regards,
Rogier


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk