Boost logo

Boost Users :

Subject: [Boost-users] [Locale] inconsistent results for utf-8 collation
From: Patrick Ohly (patrick.ohly_at_[hidden])
Date: 2012-08-29 03:27:53


Hello!

When comparing the following UTF-8 string pairs using Boost.Locale (any
backend) at the "identical" level (accents are relevant) and a UTF-8
locale (I tried de_DE.utf-8) on Debian Testing (boost 1.49), I get a
result that does not make sense to me.

"Muller" is considered less than "Müller" (as expected), but "Muller 2"
is considered more than "Müller 1", despite the different result for the
names alone.

Do I have bug in my code, in the underlying libraries or in my
expectations?

#include <locale.h>

#include <boost/locale.hpp>
#include <boost/assign/std/vector.hpp>
#include <boost/foreach.hpp>
#include <boost/assign/list_of.hpp>
#include <boost/algorithm/string/join.hpp>
#include <iostream>

int main(int argc, char **argv)
{
    setlocale(LC_ALL, "");

    std::cout << "backends: " <<
        boost::join(boost::locale::localization_backend_manager::global().get_all_backends(),
                    ", ") << std::endl;
    boost::locale::localization_backend_manager::global().select(argc > 2 ? argv[2] : "icu");
    std::locale loc = boost::locale::generator()(argc > 1 ? argv[1] : "de_DE.UTF-8");

    typedef boost::tuple<std::string, std::string> string_pair_t;
    std::vector<string_pair_t> pairs =
        boost::assign::tuple_list_of("Muller", "Müller")
        ("Muller 2", "Müller 1")
        ("Muller B", "Müller A");
    BOOST_FOREACH (const string_pair_t &pair, pairs) {
        const std::string &a = boost::get<0>(pair),
            &b = boost::get<1>(pair);
        int cmp = std::use_facet<boost::locale::collator<char> >(loc).
            compare(boost::locale::collator_base::identical, a, b);
        std::cout <<
            a << " and " << b <<
            " are " <<
            (cmp == 0 ? "identical" : "different") <<
            " (" <<
            (cmp < 0 ? '<' :
                   cmp > 0 ? '>' : '=') <<
            ")" << std::endl;
    }

    return 0;
}

The output on my system:

$ /tmp/mueller de_DE.utf-8 icu
backends: icu, posix, std
Muller and Müller are different (<)
Muller 2 and Müller 1 are different (>)
Muller B and Müller A are different (>)

Bye, Patrick


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net