Boost logo

Boost Users :

From: Zach Laine (whatwasthataddress_at_[hidden])
Date: 2019-10-30 15:56:17


On Wed, Oct 30, 2019 at 8:03 AM Rainer Deyke via Boost-users <
boost-users_at_[hidden]> wrote:

> On 26.10.19 18:41, Zach Laine via Boost-users wrote:
> > NFC, very close to FCC, is more popular, due to its compactness. I
> picked
> > the normalization form with the most readily available time and space
> > optimizations, and then stuck to just that one -- the alternative is many
> > text types with different normalizations having to interoperate, which
> > sounds like hell.
>
> I can understand that, all other things being equal, the more compact
> form might be preferable. I mean, if you know nothing about Unicode
> normalization forms other than that one is more compact than the other,
> then you might as well pick the more compact one, right?
>
> But all other things are clearly /not/ equal, or you would just use NFC.
> And the difference in compactness between NFC and NFD is completely
> trivial. I challenge you to find any real-world text where the
> difference is size between NFC and NFD is big enough that I should care
> about it, both in absolute and relative terms.
>
> I consider FCC a non-solution to a non-problem. The advantage of NFC
> over NFD is not compactness, but compatibility with interfaces that
> expect NFC. Since FCC does not provide that advantage, there is no
> reason to choose FCC over NFD. On the other hand, there are several
> good reasons for choosing NFD over FCC. Aside from the obvious one -
> compatibility with interfaces that expect NFD - there's also cleaner,
> simpler code with fewer surprises. For example, it is a completely
> straightforward operation to replace all acute accents in a NFD text
> with grave accents or to remove acute accents entirely, whereas the FCC
> equivalent requires effectively transcoding to NFD.
>
> In summary, I think you should support NFD text types. Either in
> addition to FCC or instead of it.
>

NFD is not an unreasonable choice, though I don't know why you'd want to do
a search-replace that changes all het accents from acute to grave (is that
a real use-case, or just a for-instance?). Unfortunately, the fast-path of
the collation algorithm implementation requires FCC, which is why ICU uses
it, and one of the main reasons why I picked it. If we had NFD strings,
we'd have to normalize them to FCC first, if I'm not mistaken. (Though I
should verify that with a test.)

Zach



Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net