Boost :

Date view	Thread view	Subject view	Author view

Subject: [boost] [locale] Normalization and transformation
From: Denis Arnaud (denis.arnaud_boost_at_[hidden])
Date: 2012-09-23 07:47:58

Next message: Denis Arnaud: "Re: [boost] [locale] Normalization and transformation"
Previous message: Tim Blechmann: "Re: [boost] [graph][heap][coroutine] interruptable dijkstra shortest path function"
Next in thread: Denis Arnaud: "Re: [boost] [locale] Normalization and transformation"
Reply: Denis Arnaud: "Re: [boost] [locale] Normalization and transformation"

Hi,

following is a sample program playing with text conversion features of
Boost.Locale (Boost version 1.48.0 on Linux Fedora 17), as seen in the
documentation (http://unicode.org/reports/tr15/#Norm_Forms and
http://www.boost.org/doc/libs/1_51_0/libs/locale/doc/html/index.html):
___________________________________________________________
#include <boost/locale.hpp>

int main() {
  // Get the global localisation backend
  boost::locale::localization_backend_manager locBEMgr =
    boost::locale::localization_backend_manager::global();

// Select ICU backend as default
locBEMgr.select ("icu");

// Set this backend globally
boost::locale::localization_backend_manager::global (locBEMgr);

// Create a generator that uses this backend.
boost::locale::generator locGen (locBEMgr);

// Create locale generator with the system default locale
std::locale::global (locGen (""));

// Test string with accents (french word for "side")
std::string sideFR ("CÃ´tÃ©");

  // Test the Boost Locale string conversions
  std::cout << "Original: " << sideFR << std::endl
            <<"Upper " << boost::locale::to_upper (sideFR) << std::endl
            <<"Lower " << boost::locale::to_lower (sideFR) << std::endl
            <<"Title " << boost::locale::to_title (sideFR) << std::endl
            <<"Fold " << boost::locale::fold_case (sideFR) << std::endl
            << "Normalised - [NFD]: "
            << boost::locale::normalize (sideFR, boost::locale::norm_nfd)
            << "; [NFC]: "
            << boost::locale::normalize (sideFR, boost::locale::norm_nfc)
            << "; [NFKD]: "
            << boost::locale::normalize (sideFR, boost::locale::norm_nfkd)
            << "; [NFKC]: "
            << boost::locale::normalize (sideFR, boost::locale::norm_nfkc)
            << std::endl;

return 0;
}
___________________________________________________________

The output is:
_____________________________________________________
Original: CÃ´tÃ©
Upper CÃ”TÃ‰
Lower cÃ´tÃ©
Title CÃ´tÃ©
Fold cÃ´tÃ©
Normalised - [NFD]: CoÌ‚teÌ; [NFC]: CÃ´tÃ©; [NFKD]: CoÌ‚teÌ; [NFKC]: CÃ´tÃ©
____________________________________________________

Apparently, there is no difference in the normalization forms, whatever
method is used (NFC, NFD, NFKC, NFKD). I would expect that NFD and NFKD
would produce a different result. But maybe my (UTF8-based) Linux terminal
recomposes automatically the letters and accents, so that we do not see
the difference?
Or is it a feature of Boost.Locale, which I overlooked?

Let me state my goal.

I use Xapian (http://xapian.org/docs/) as a full-text matching engine,
and I feed it with texts of various languages and scripts, all in UTF8.
Xapian will typically match keywords with the same forms and cases;
in other words, I cannot choose or influence the collation algorithm
(http://www.unicode.org/reports/tr10/), which corresponds to the
third or fourth in Xapian (AFAIU).

To give a sample, if the "CÃ´tÃ©" word has been indexed by Xapian,
"cote" will not match with it.
So, I would like Xapian to index both forms ("CÃ´tÃ©" and "cote"),
so that both forms match. When a user gives me a string to match
against the index, I will first try to match the string itself, then
transform
it (http://www.icu-project.org/icu-bin/translit) and try to match the
transformed version.

So, my question is: Is Boost.Locale capable of transforming Unicode
strings, for instance to remove accents?

A good example is provided by the first answer to the StackOverflow
question:
http://stackoverflow.com/questions/144761/how-to-remove-accents-and-tilde-in-a-c-stdstring
In other words, I would like to apply the "NFD; [:M:] remove; NFC"
transformation, as is possible with the ICU library. The following is a
code sample showing how to do that:
____________________________________________________
// ICU
#include <unicode/translit.h>
#include <unicode/unistr.h>
#include <unicode/ucnv.h>
int main() {
  // Create a Normalizer
  UErrorCode status = U_ZERO_ERROR;
  const char* lNormaliserID = "NFD; [:M:] Remove; NFC;";
  lNormaliser =
    Transliterator::createInstance (lNormaliserID, UTRANS_FORWARD, status);

  // Register the Transliterator
  Transliterator::registerInstance (lNormaliser);
  UnicodeString myString ("CÃ´tÃ©");
  lNormaliser->transliterate (myString);
  std::cout << "Normalized version without accents"
                 << toUTF8String (lQueryString) << std::endl;

return 0;
}
____________________________________________________

Do not hesitate if you have suggestion, feedback, work around...

Kind regards

-denis

Next message: Denis Arnaud: "Re: [boost] [locale] Normalization and transformation"
Previous message: Tim Blechmann: "Re: [boost] [graph][heap][coroutine] interruptable dijkstra shortest path function"
Next in thread: Denis Arnaud: "Re: [boost] [locale] Normalization and transformation"
Reply: Denis Arnaud: "Re: [boost] [locale] Normalization and transformation"

Date view	Thread view	Subject view	Author view

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk