Boost logo

Boost :

From: Rogier van Dalen (rogiervd_at_[hidden])
Date: 2004-10-21 15:57:37


On Thu, 21 Oct 2004 12:10:28 -0700, Eric Niebler
<eric_at_[hidden]> wrote:
>
> Rogier van Dalen wrote:
> > On Wed, 20 Oct 2004 12:48:31 -0700, Eric Niebler
> > <eric_at_[hidden]> wrote:
> >
> >>I think the default should be UTF-16 encoding, and that the iterator
> >>should use a scheme like this to be random access. Rationale: there are
> >>string algorithms that benefit from random access (Boyer-Moore comes to
> >>mind).
> >
> >
> > Correct me if I'm wrong. From what I gather from a Google search,
> > Boyer-Moore is a fast string search algorithm. Why not use the
> > algorithm on the code units rather than codepoints? UTF-8 and UTF-16
> > are both not stateful, specifically to allow optimisations such as
> > this (as well as error recovery).
> >
>
>
> Searching a Unicode string for a particular bit pattern is not
> particularly meaningful because the same string can be represented with
> different bit patterns. Have I misinterpreted what you are suggesting?

If the strings are not normalised, that is correct.
However, if the strings are normalised (to the same form), then the
codepoint patter will be the same; and so will the code unit pattern
(through the way UTF-8 and UTF-16 are specified). If you search a
Unicode string for a substring using code units, you *will* find all
matches, plus some bogus matches. For example, if you're looking for
"can" you will match "cañ" too, in normalisation form D. To get rid of
these bogus matches, all you have to do is check that no combining
marks are following, and otherwise, proceed to the next match.
So Boyer-Moore may be used on the code units, as long as it is checked
that a match includes full abstract characters only.

Hope this makes clearer what I'm trying to say.
Rogier


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk