Boost logo

Boost Users :

From: Rainer Deyke (rainerd_at_[hidden])
Date: 2019-10-30 20:02:36


On 30.10.19 16:56, Zach Laine via Boost-users wrote:
> On Wed, Oct 30, 2019 at 8:03 AM Rainer Deyke via Boost-users <
> boost-users_at_[hidden]> wrote:
>> In summary, I think you should support NFD text types. Either in
>> addition to FCC or instead of it.
>
> NFD is not an unreasonable choice, though I don't know why you'd want to do
> a search-replace that changes all het accents from acute to grave (is that
> a real use-case, or just a for-instance?).

The specific example is just hypothetical, but wanting to operate on
diacritics and base characters separately is real enough. Better
examples: checking that Chinese pinyin syllables have their tone markers
on the correct vowel. Or collecting statistics on the use of diacritics
in a text. Or testing if a font has all of the glyphs needed to render
a text. Or replacing a diacritic that's on my keyboard layout for
another one that's not. Or even just collation.

> Unfortunately, the fast-path of
> the collation algorithm implementation requires FCC, which is why ICU uses
> it, and one of the main reasons why I picked it. If we had NFD strings,
> we'd have to normalize them to FCC first, if I'm not mistaken. (Though I
> should verify that with a test.)

It find that surprising, since FCC more than any other normalization
form mixes precomposed and decomposed characters. But I will say this
for FCC: at least it's easy to transcode from FCC to NFD. It could even
be done in a fairly straightforward iterator adapter.

-- 
Rainer Deyke (rainerd_at_[hidden])

Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net