Boost logo

Boost :

Subject: [boost] [locale] Normalization and transformation
From: Denis Arnaud (denis.arnaud_boost_at_[hidden])
Date: 2012-09-23 07:47:58


Hi,

following is a sample program playing with text conversion features of
Boost.Locale (Boost version 1.48.0 on Linux Fedora 17), as seen in the
documentation (http://unicode.org/reports/tr15/#Norm_Forms and
http://www.boost.org/doc/libs/1_51_0/libs/locale/doc/html/index.html):
___________________________________________________________
 #include <boost/locale.hpp>

 int main() {
  // Get the global localisation backend
  boost::locale::localization_backend_manager locBEMgr =
    boost::locale::localization_backend_manager::global();

  // Select ICU backend as default
  locBEMgr.select ("icu");

  // Set this backend globally
  boost::locale::localization_backend_manager::global (locBEMgr);

  // Create a generator that uses this backend.
  boost::locale::generator locGen (locBEMgr);

  // Create locale generator with the system default locale
  std::locale::global (locGen (""));

  // Test string with accents (french word for "side")
  std::string sideFR ("Côté");

  // Test the Boost Locale string conversions
  std::cout << "Original: " << sideFR << std::endl
            <<"Upper " << boost::locale::to_upper (sideFR) << std::endl
            <<"Lower " << boost::locale::to_lower (sideFR) << std::endl
            <<"Title " << boost::locale::to_title (sideFR) << std::endl
            <<"Fold " << boost::locale::fold_case (sideFR) << std::endl
            << "Normalised - [NFD]: "
            << boost::locale::normalize (sideFR, boost::locale::norm_nfd)
            << "; [NFC]: "
            << boost::locale::normalize (sideFR, boost::locale::norm_nfc)
            << "; [NFKD]: "
            << boost::locale::normalize (sideFR, boost::locale::norm_nfkd)
            << "; [NFKC]: "
            << boost::locale::normalize (sideFR, boost::locale::norm_nfkc)
            << std::endl;

  return 0;
}
___________________________________________________________

The output is:
_____________________________________________________
Original: Côté
Upper CÔTÉ
Lower côté
Title Côté
Fold côté
Normalised - [NFD]: Côté; [NFC]: Côté; [NFKD]: Côté; [NFKC]: Côté
____________________________________________________

Apparently, there is no difference in the normalization forms, whatever
method is used (NFC, NFD, NFKC, NFKD). I would expect that NFD and NFKD
would produce a different result. But maybe my (UTF8-based) Linux terminal
recomposes automatically the letters and accents, so that we do not see
the difference?
Or is it a feature of Boost.Locale, which I overlooked?

Let me state my goal.

I use Xapian (http://xapian.org/docs/) as a full-text matching engine,
and I feed it with texts of various languages and scripts, all in UTF8.
Xapian will typically match keywords with the same forms and cases;
in other words, I cannot choose or influence the collation algorithm
(http://www.unicode.org/reports/tr10/), which corresponds to the
third or fourth in Xapian (AFAIU).

To give a sample, if the "Côté" word has been indexed by Xapian,
"cote" will not match with it.
So, I would like Xapian to index both forms ("Côté" and "cote"),
so that both forms match. When a user gives me a string to match
against the index, I will first try to match the string itself, then
transform
it (http://www.icu-project.org/icu-bin/translit) and try to match the
transformed version.

So, my question is: Is Boost.Locale capable of transforming Unicode
strings, for instance to remove accents?

A good example is provided by the first answer to the StackOverflow
question:
http://stackoverflow.com/questions/144761/how-to-remove-accents-and-tilde-in-a-c-stdstring
In other words, I would like to apply the "NFD; [:M:] remove; NFC"
transformation, as is possible with the ICU library. The following is a
code sample showing how to do that:
____________________________________________________
// ICU
#include <unicode/translit.h>
#include <unicode/unistr.h>
#include <unicode/ucnv.h>
int main() {
  // Create a Normalizer
  UErrorCode status = U_ZERO_ERROR;
  const char* lNormaliserID = "NFD; [:M:] Remove; NFC;";
  lNormaliser =
    Transliterator::createInstance (lNormaliserID, UTRANS_FORWARD, status);

  // Register the Transliterator
  Transliterator::registerInstance (lNormaliser);
  UnicodeString myString ("Côté");
  lNormaliser->transliterate (myString);
  std::cout << "Normalized version without accents"
                 << toUTF8String (lQueryString) << std::endl;

  return 0;
 }
____________________________________________________

Do not hesitate if you have suggestion, feedback, work around...

Kind regards

-denis


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk