Boost logo

Boost Users :

From: Zach Laine (whatwasthataddress_at_[hidden])
Date: 2019-11-02 18:54:34


On Fri, Nov 1, 2019 at 4:22 PM Zach Laine <whatwasthataddress_at_[hidden]>
wrote:

> On Fri, Nov 1, 2019 at 3:41 PM Mathias Gaunard <
> mathias.gaunard_at_[hidden]> wrote:
>
>> To search for the utf-8 substring "foo" in the utf-8 string "I really
>>
> like foo dogs", there is no need to iterate the string per code point
>> or per grapheme as you do in your examples. You can just perform the
>> search at the code unit level, then check that the position before and
>> after the match does not lie inside a grapheme cluster, i.e. they are
>> on a valid boundary.
>> What you need to be able to do that is a function that tells you
>> whether an arbitrary position in your sequence of utf-8 code units
>> lies at a grapheme cluster boundary or not (which would probably be a
>> composition of two separate functions, one that test whether the code
>> unit is on a code point boundary, and one that tests whether the code
>> point is on a grapheme cluster boundary). This functionality is not
>> provided.
>>
>> This sort of thing is briefly touched upon in Unicode TR#29 6.4.
>>
>
> I see. This seems like it might be really useful to add. I'll open a
> ticket for it on Github.
>

After writing this, I realized this is supported by calling
prev_grapheme_break(first, it, last) == it. There is an exception to this,
though, when it == last. I should either remove that exception (which
sounds like the right answer regardless of the rest), or provide
at_grapheme_break(first, it, last) (probably a good thing to do regardless
of the rest).

Zach



Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net