|
Boost : |
Subject: Re: [boost] [locale] review
From: Artyom (artyomtnk_at_[hidden])
Date: 2011-04-17 04:46:47
> From: Phil Endecott <spam_from_boost_dev_at_[hidden]>
>
> As others have said, a similar introduction to icu,
> perhaps even deep links to appropriate parts of its
> documentation, and perhaps also to Unicode docs would be useful.
The point is that Boost.Locale uses ICU as one of its back-ends
(even thou it is primary and the bast one) so I even prefer
to decouple Boost.Locale from ICU at some level so users
would see that Boost.Locale is not just convince wrapper of
ICU but more then that.
Also Unicode is very-very large topic and basically the
search in google for ICU and Unicode would give you the
best links...
> I found that it was not sufficiently easy to find
> which header each function is in, e.g. on the charset_handling.html
> page the name of the header is not mentioned; if I click on e.g. from_utf
> I get to a reference page that also doesn't name the header.
>
Good point. I'll add it.
I'm little bit surprised about missing header name int the reference
docs because it what Doxygen usually does.
>
> Use of strings rather than symbols:
> there are a few places where the library
> uses strings such as "UTF-8" or "de_DE.UTF8"
> rather than symbols like utf_8 or de_DE_UTF8.
> This use of strings can make things run-time errors
> that could otherwise be compile-time errors.
> Mis-typing e.g. "UTF_8" or "de-DE" is too easy.
> Perhasp in some cases the domain of the parameter
> is not known at compile time, but even in those cases,
> it should be possible to provide symbols for the most
> common choices.
What most common choices? There are few dozen of different
character encodings, there are even more locales, and what
considered common?
Also not all encodings are supported by all backends. For example
iconv would not handle some windows codepages and MultiByteTo..
would not handle some other encodings.
Locales, Is de_DE.UTF-8 common? Is he_IL.UTF-8 common?
Is zh_CN.UTF-8 common?
Also the level of support by different backends may depend
on actually OS configuration - if some locale is not configured
on Windows or Linux that non-ICU backends would fallback to
the C/POSIX locale.
So should there be hard coded constants for locales and encodings?
I doubt.
However, I think there may be useful conversion like:
std::string utf8_str=utf_to_utf<char>(L"Hello"); // UTF-16/UTF-32 to UTF-8
And it was rised by other reviewers and I actually consider adding something
like that to the code
Also small note, about character encoding mistype.
The encodings compared as characters and numbers only, the rest of the
symbols are ignored, so from Boost.Locale and ICU point of view:
UTF-8, UTF8, Utf_8 and UTF#$%^#$8 are the same.
>
>
> Collation: I would be interested to know how fast these
> comparisons are compared to regular std::string::operator<.
> If I correctly understand what transform does (more docs needed),
> it would be interesting to then compare the performance of
> std::map<...,comparator> with a std::map<> storing transformed strings.
> Or similarly, to compare std::sort using the comparator with std::sort
> of a pair<transformed string, original string>.
Transform is something very common for collation purposes. Every
collation API provides it (ICU, POSIX, C and Win32 API).
I hadn't do any specific benchmarks but it is much faster as
collation normally involves Unicode normalization and then
search over character database for specific locale.
I doubt it std::sort of transofrmed messages would be much faster
as transform costs as well.
But if you search frequently for some string in database
then it as significant performance impacts. As you transform
it only once and then you just run strcmp which is much faster
>
> Ideally these collation functions would take character ranges
> of arbitrary types, however I guess that the icu API takes
> char* making this unreasonable. Apparently there is no option
> for a NULL-terminated char*, unlike the case conversion functions, right?
>
Good point. I'll take a look on this.
>
> Case conversions: I don't know what
> "title case" is or what "fold case" means;
> the docs should define this.
> At least linking to the definitions of the Unicode
> normalisation forms would also be useful;
> preferably, some explanation of how they
> should be used would be included.
Noted. Good point.
>
> I can think of two recent case-conversion problems that I have had, both
>related to geographic names:
>
> 1. I had a list of North American names, in "UPPER CASE".
> I decided to convert them to "Initial Capitals" for better appearance,
> but later realised that this was broken for accented names in Quebec.
> I could imagine some sort of "case conversion is lossy" property
> of a locale that would detect this; does anything like that exist? (I
>guess not.)
>
Actually any case conversion is "lossy" because it does not know the original
case,
however the good news is that most of code-points in Unicode database does
not have case at all.
Roughly Only European languages actually have case, the Semic, East-Asian and
many
others don't have case at all.
And BTW "Inital Capitals" is actually Title Case.
> 2. In a similar case I found that " 's " was being
> wrongly capitalised, e.g. "St. John'S Street".
> (This was with PostgreSQL's case conversion functions.)
> Again it would be useful to know whether
> (a) does the library try to handle this sort of thing correctly, and
> (b) can it tell me whether the operation that I want to do is lossy or not?
> For example, say I have a set of place names that are a mixture of en_US and
>en_CA
> and fr_CA but I don't know which; can I perform a conversion that
> works because those locales all have the same rules for e.g. ' and -? (I
>guess not.)
Case handling is actually Unicode algorithms, and it behaves same for almost
all locales but few: Azeri and Turkish where "i" converted to "I with dot"
and "I converted to i without dot" and AFAIR there is one or two
other coner cases locale dependent.
In any case case handling is always lossy but depends on how you define it.
To answer the (a) yes it handles it right using ICU backend.
If you ask if it does everything right then not. Because
different languages may have different capitalization rules
but as far as I see Unicode algorithm and current ICU library
ignores them.
But it is still the best we get.
>
>
> Character set conversion:
> I think it's great that Boost might finally get character set
> conversion - it's overdue! However, I'm not enthusiastic about this:
>
> std::string utf8_string = to_utf<char>(latin1_string,"Latin1");
>
> 1. It should be possible for the source character set to be a
> compile-time constant symbol, and a template parameter e.g.
> conv<latin1,utf8>(str). As it is, if I want to convert just a
> few bytes, it will surely spend more time decoding the charset name
> string than actually converting.
See notes above about constants. Also Boost.Locale does
not use its own conversion tables, it relays on 3rd part API
like ICU, iconv or Win32 API.
So this way or other you will spend some time calling
uconv_open or iconv_open or something else.
If you want to make stream conversions or convert few
characters a time you can use codecvt API.
There is actually an example in docs about using codecvt.
>
> 2. As above, ideally it should take a range of arbitrary
> type and output to a back_insert_iterator; presumably
> this is an icu limitation.
Actually iterator is bad ideome for charset conversion it
is more an operation on a stream. And this is what
codecvt is useable for.
And note that not all APIs support stream operations most
noticeable is Win32 API MultiByte... and iconv. I had very hard
times to support this stream conversions as it wasn't
too friendly API in reallity. As generally codepage
conversion is much more stateful then most of us think.
For example iconv may composite several character to a single
code-point.
>
> 3. The error handling seems to be basic - it's "all or nothing".
> Considering from_utf, since this is the direction where there
> are more things that can be unrepresentable, I might want to:
> - Skip the character;
> - Replace with '?';
> - Replace with an unaccented version, or e.g. curly-quotes to plain-quotes;
> - Throw an exception that aborts the whole operation;
> - Do the conversion up to the point of difficulty and then indicate the
>problem.
>
Few points:
- Skip and Throw is supported.
> - Replace with an unaccented version, or e.g. curly-quotes to plain-quotes;
I'm not aware of any charset API that actually supports
such things.
> - Replace with '?';
I think I noted before, it is problematic with non-ICU
backends as not all support such operations and more
then that some do not even provide a way to know
where operation had failed (Win32 API)
> - Do the conversion up to the point of difficulty and then indicate the
>problem.
This is problem as well as not all API provide this.
> 4. There's also the case where the input is chunked,
> and the chunk-breaks might fall within multi-byte characters;
> so individual chunks are potentially malformed while
> the overall stream is OK. I think the best way to handle
> this is with a stateful functor to do the comparison; I wrote one using
> iconv, like this:
>
> Iconver<latin1,utf8> latin1_to_utf8;
>
> while (!end_of_input()) {
> output.append( latin1_to_utf8(input.get_chunk()) );
> }
> assert(latin1_to_utf8.empty()); // detect unconverted partial chars at end
>of input
>
There is a codecvt for this as I mentioned above, also
codepage conversion is not the primary goal of Boost.Locale
but rather utility task.
> Also: how is the performance of this conversion?
> If it is being done by icu, we should have a hook
> (i.e. specialisation, hence my suggestion of template
> parameters for the character sets) so that we can provide
> optimised versions of important conversions.
>
I had found that iconv and ICU perform similarry in terms
of performance.
The performance of codecvt for UTF-8 and sinle byte charact
sets is very good for multi-byte like Shift-JIS not so good
but it is rather limitation of std::codecvt design.
> I note that a union is being used in what I believe is an undefined way:
>
> union {
> char first;
> uint16_t u16;
> uint32_t u32;
> } v;
>
> v.u16 = 1;
> if(v.first == 1) {
> return "UTF-16LE";
> }
> else {
> return "UTF-16BE";
> }
No problems there (see response of Steven)
>
> In my own iconv wrapper code I need a const_cast because
> the POSIX spec misses a const on the second argument to iconv();
> some platforms correct this, but not all:
Actually POSIX specifies restrict which is little bit stronger
then const.
>
> #if defined(__FreeBSD__) || defined(__OpenBSD__) || defined(__NetBSD__) ||
>defined(__sun__)
> // Previously __APPLE__ was included in this list; presumably they have
> // changed their headers. If you have an older system you may need to put
> // it back.
> #define ICONV_ARG2_IS_CONST
> #endif
>
It is wrong as it may depend on version of iconv...
And Boost.Locale does it smarted see response of Steven.
> This doesn't seem to be done here.
>
> There seems to be some use of reinterpret_cast which I think should be a
>static_cast via void*:
>
> virtual std::string convert(char_type const *ubegin,char_type const
*uend)
> {
> ....
> char const *begin = reinterpret_cast<char const *>(ubegin);
>
No problems there as wellm , see Steven's response
> "int err" should probably be "errno_t err".
>
No, see: http://pubs.opengroup.org/onlinepubs/009695399/functions/errno.html
errno defined as int by POSIX specifications.
> There seems to be a "magic constant" of 10 to allow for the difference in
> size between the input and the output.
>
So? It is just guess for extend it does not has other meanings.
> Some comments would help. For example, to describe the meaning
> of the return values and error codes from iconv().
>
man iconv?
Also EINvalidVALue, E2ooBIG and EILlligalSEQence are quite self explaining
at least for thous who familiar with Unix programming.
>
> - What is your evaluation of the documentation?
>
> Good enough.
>
> - What is your evaluation of the potential usefulness of the
> library?
>
> Useful and overdue.
>
> - Did you try to use the library? With what compiler? Did you
> have any problems?
>
> No, I've not compiled anything.
>
> - How much effort did you put into your evaluation? A glance? A
> quick reading? In-depth study?
>
> 3 hours reading the docs and looking at a tiny bit of the implementation.
>
> - Are you knowledgeable about the problem domain?
>
> Not beyond the areas that I have discussed above.
>
>
> > Please explicitly state in your review whether the library should be
> > accepted.
>
> It's borderline. I was going to say "yes, but please
> consider updating at least the charset conversion functions
> in the way that I proposed".
Some things like add utf-to-utf conversion would be added, however
I doubt about the substitution for the reasons I explained above.
> But then I looked at the source and noted what I think are well-known issues
>like abuse of union and > reinterpret_cast, which have lowered my opinion a
>bit.
There was nothing wrong and Steven had answered for most
points.
>
>
> Regards, Phil.
>
>
Thanks for the review.
Artyom
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk