Boost logo

Boost :

Subject: Re: [boost] GSoC Unicode library: second preview
From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2009-06-21 21:23:24


Scott McMurray wrote:

> Suppose I have "difficult" with the "ffi" ligature codepoint, and I do
> a perl-style split on /i/.

There is no way for "i" to match as being part of that string unless you
replace the "ffi" ligature by the letters "f", "f", "i".
That operation is known as a compatibility decomposition (and will be
provided by the library in due time, of course, along with compatibility
composition, canonical decomposition, canonical composition and the
normalization forms that are defined in terms of them)

You could choose to apply split with arguments normalized according to
normalization form KC, which allows comparison independently of
formatting considerations.
But that also means 5 will match ⁵. You could choose that 5 should match
⁵, but ⁵ should not match 5, so the pattern should be in NFC but the
string to search in in NFKC.

> I should probably be getting "d", the "ff"
> ligature codepoint, and "cult". I know if I tried to code that by
> hand in every application I'd miss all kinds of evil corner cases like
> that.

Unfortunately Unicode is made of a lot of case corners, and there is no
way around it without understanding it.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk