Boost logo

Boost :

Subject: [boost] [locale] Normalization and transformation
From: Denis Arnaud (denis.arnaud_boost_at_[hidden])
Date: 2012-09-23 07:47:58


following is a sample program playing with text conversion features of
Boost.Locale (Boost version 1.48.0 on Linux Fedora 17), as seen in the
documentation ( and
 #include <boost/locale.hpp>

 int main() {
  // Get the global localisation backend
  boost::locale::localization_backend_manager locBEMgr =

  // Select ICU backend as default ("icu");

  // Set this backend globally
  boost::locale::localization_backend_manager::global (locBEMgr);

  // Create a generator that uses this backend.
  boost::locale::generator locGen (locBEMgr);

  // Create locale generator with the system default locale
  std::locale::global (locGen (""));

  // Test string with accents (french word for "side")
  std::string sideFR ("Côté");

  // Test the Boost Locale string conversions
  std::cout << "Original: " << sideFR << std::endl
            <<"Upper " << boost::locale::to_upper (sideFR) << std::endl
            <<"Lower " << boost::locale::to_lower (sideFR) << std::endl
            <<"Title " << boost::locale::to_title (sideFR) << std::endl
            <<"Fold " << boost::locale::fold_case (sideFR) << std::endl
            << "Normalised - [NFD]: "
            << boost::locale::normalize (sideFR, boost::locale::norm_nfd)
            << "; [NFC]: "
            << boost::locale::normalize (sideFR, boost::locale::norm_nfc)
            << "; [NFKD]: "
            << boost::locale::normalize (sideFR, boost::locale::norm_nfkd)
            << "; [NFKC]: "
            << boost::locale::normalize (sideFR, boost::locale::norm_nfkc)
            << std::endl;

  return 0;

The output is:
Original: Côté
Upper CÔTÉ
Lower côté
Title Côté
Fold côté
Normalised - [NFD]: Côté; [NFC]: Côté; [NFKD]: Côté; [NFKC]: Côté

Apparently, there is no difference in the normalization forms, whatever
method is used (NFC, NFD, NFKC, NFKD). I would expect that NFD and NFKD
would produce a different result. But maybe my (UTF8-based) Linux terminal
recomposes automatically the letters and accents, so that we do not see
the difference?
Or is it a feature of Boost.Locale, which I overlooked?

Let me state my goal.

I use Xapian ( as a full-text matching engine,
and I feed it with texts of various languages and scripts, all in UTF8.
Xapian will typically match keywords with the same forms and cases;
in other words, I cannot choose or influence the collation algorithm
(, which corresponds to the
third or fourth in Xapian (AFAIU).

To give a sample, if the "Côté" word has been indexed by Xapian,
"cote" will not match with it.
So, I would like Xapian to index both forms ("Côté" and "cote"),
so that both forms match. When a user gives me a string to match
against the index, I will first try to match the string itself, then
it ( and try to match the
transformed version.

So, my question is: Is Boost.Locale capable of transforming Unicode
strings, for instance to remove accents?

A good example is provided by the first answer to the StackOverflow
In other words, I would like to apply the "NFD; [:M:] remove; NFC"
transformation, as is possible with the ICU library. The following is a
code sample showing how to do that:
// ICU
#include <unicode/translit.h>
#include <unicode/unistr.h>
#include <unicode/ucnv.h>
int main() {
  // Create a Normalizer
  UErrorCode status = U_ZERO_ERROR;
  const char* lNormaliserID = "NFD; [:M:] Remove; NFC;";
  lNormaliser =
    Transliterator::createInstance (lNormaliserID, UTRANS_FORWARD, status);

  // Register the Transliterator
  Transliterator::registerInstance (lNormaliser);
  UnicodeString myString ("Côté");
  lNormaliser->transliterate (myString);
  std::cout << "Normalized version without accents"
                 << toUTF8String (lQueryString) << std::endl;

  return 0;

Do not hesitate if you have suggestion, feedback, work around...

Kind regards


Boost list run by bdawes at, gregod at, cpdaniel at, john at