|
Boost : |
Subject: [boost] [locale] Normalization and transformation
From: Denis Arnaud (denis.arnaud_boost_at_[hidden])
Date: 2012-09-23 07:47:58
Hi,
following is a sample program playing with text conversion features of
Boost.Locale (Boost version 1.48.0 on Linux Fedora 17), as seen in the
documentation (http://unicode.org/reports/tr15/#Norm_Forms and
http://www.boost.org/doc/libs/1_51_0/libs/locale/doc/html/index.html):
___________________________________________________________
#include <boost/locale.hpp>
int main() {
// Get the global localisation backend
boost::locale::localization_backend_manager locBEMgr =
boost::locale::localization_backend_manager::global();
// Select ICU backend as default
locBEMgr.select ("icu");
// Set this backend globally
boost::locale::localization_backend_manager::global (locBEMgr);
// Create a generator that uses this backend.
boost::locale::generator locGen (locBEMgr);
// Create locale generator with the system default locale
std::locale::global (locGen (""));
// Test string with accents (french word for "side")
std::string sideFR ("Côté");
// Test the Boost Locale string conversions
std::cout << "Original: " << sideFR << std::endl
<<"Upper " << boost::locale::to_upper (sideFR) << std::endl
<<"Lower " << boost::locale::to_lower (sideFR) << std::endl
<<"Title " << boost::locale::to_title (sideFR) << std::endl
<<"Fold " << boost::locale::fold_case (sideFR) << std::endl
<< "Normalised - [NFD]: "
<< boost::locale::normalize (sideFR, boost::locale::norm_nfd)
<< "; [NFC]: "
<< boost::locale::normalize (sideFR, boost::locale::norm_nfc)
<< "; [NFKD]: "
<< boost::locale::normalize (sideFR, boost::locale::norm_nfkd)
<< "; [NFKC]: "
<< boost::locale::normalize (sideFR, boost::locale::norm_nfkc)
<< std::endl;
return 0;
}
___________________________________________________________
The output is:
_____________________________________________________
Original: Côté
Upper CÃTÃ
Lower côté
Title Côté
Fold côté
Normalised - [NFD]: CoÌteÌ; [NFC]: Côté; [NFKD]: CoÌteÌ; [NFKC]: Côté
____________________________________________________
Apparently, there is no difference in the normalization forms, whatever
method is used (NFC, NFD, NFKC, NFKD). I would expect that NFD and NFKD
would produce a different result. But maybe my (UTF8-based) Linux terminal
recomposes automatically the letters and accents, so that we do not see
the difference?
Or is it a feature of Boost.Locale, which I overlooked?
Let me state my goal.
I use Xapian (http://xapian.org/docs/) as a full-text matching engine,
and I feed it with texts of various languages and scripts, all in UTF8.
Xapian will typically match keywords with the same forms and cases;
in other words, I cannot choose or influence the collation algorithm
(http://www.unicode.org/reports/tr10/), which corresponds to the
third or fourth in Xapian (AFAIU).
To give a sample, if the "Côté" word has been indexed by Xapian,
"cote" will not match with it.
So, I would like Xapian to index both forms ("Côté" and "cote"),
so that both forms match. When a user gives me a string to match
against the index, I will first try to match the string itself, then
transform
it (http://www.icu-project.org/icu-bin/translit) and try to match the
transformed version.
So, my question is: Is Boost.Locale capable of transforming Unicode
strings, for instance to remove accents?
A good example is provided by the first answer to the StackOverflow
question:
http://stackoverflow.com/questions/144761/how-to-remove-accents-and-tilde-in-a-c-stdstring
In other words, I would like to apply the "NFD; [:M:] remove; NFC"
transformation, as is possible with the ICU library. The following is a
code sample showing how to do that:
____________________________________________________
// ICU
#include <unicode/translit.h>
#include <unicode/unistr.h>
#include <unicode/ucnv.h>
int main() {
// Create a Normalizer
UErrorCode status = U_ZERO_ERROR;
const char* lNormaliserID = "NFD; [:M:] Remove; NFC;";
lNormaliser =
Transliterator::createInstance (lNormaliserID, UTRANS_FORWARD, status);
// Register the Transliterator
Transliterator::registerInstance (lNormaliser);
UnicodeString myString ("Côté");
lNormaliser->transliterate (myString);
std::cout << "Normalized version without accents"
<< toUTF8String (lQueryString) << std::endl;
return 0;
}
____________________________________________________
Do not hesitate if you have suggestion, feedback, work around...
Kind regards
-denis
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk