Boost logo

Boost Users :

Subject: Re: [Boost-users] [Locale] inconsistent results for utf-8 collation
From: Patrick Ohly (patrick.ohly_at_[hidden])
Date: 2012-08-29 08:39:56


Artyom Beilis <artyomtnk <at> yahoo.com> writes:
> > When comparing the following UTF-8 string pairs using Boost.Locale (any
> > backend) at the "identical" level (accents are relevant) and a UTF-8
> > locale (I tried de_DE.utf-8) on Debian Testing (boost 1.49), I get a
> > result that does not make sense to me.
[...]
> Collations != Lexicographical Comparison.
>
> It is not a mistake that you get
> the same results for all backends: icu, posix and std.
>
> Take even OS C API strcoll you'll see the same behavior (for the reason)

strcoll() indeed reports the same result. However, it is uncertain at
which level it operates. "Muller" and "Müller" are different, so it is
not the primary level.

> The point is that the difference between "B" and "A" is more important
> than the difference between "ü" and "u"
>
> i.e. it first sorts "Muller B" and "Müller A" without accents and than
> sorts if identical according to the accents.

I'm was using the "identical" level with the expectation that this would make
the difference because of accents as relevant as differences between
characters. I now understand that this is not how the Unicode collation
algorithm works. Thanks for pointing that out.

Bye, Patrick


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net