On Fri, Nov 1, 2019 at 4:22 PM Zach Laine <whatwasthataddress@gmail.com> wrote:
On Fri, Nov 1, 2019 at 3:41 PM Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:
To search for the utf-8 substring "foo" in the utf-8 string "I really
like foo dogs", there is no need to iterate the string per code point
or per grapheme as you do in your examples. You can just perform the
search at the code unit level, then check that the position before and
after the match does not lie inside a grapheme cluster, i.e. they are
on a valid boundary.
What you need to be able to do that is a function that tells you
whether an arbitrary position in your sequence of utf-8 code units
lies at a grapheme cluster boundary or not (which would probably be a
composition of two separate functions, one that test whether the code
unit is on a code point boundary, and one that tests whether the code
point is on a grapheme cluster boundary). This functionality is not
provided.

This sort of thing is briefly touched upon in Unicode TR#29 6.4.

I see.  This seems like it might be really useful to add.  I'll open a ticket for it on Github.

After writing this, I realized this is supported by calling prev_grapheme_break(first, it, last) == it.  There is an exception to this, though, when it == last.  I should either remove that exception (which sounds like the right answer regardless of the rest), or provide at_grapheme_break(first, it, last) (probably a good thing to do regardless of the rest).

Zach