Boost logo

Boost :

Subject: [boost] [locale] review
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2011-04-16 18:02:06


Dear All,

Here is a brief review of the proposed Boost.Locale library. I know
little about the internationalisation issues that this library
addresses, so I will focus just on the few areas where I do have some experience.

The documentation is generally of good quality but could be expanded in
places as noted below. I appreciate the introduction of std::locale at
the beginning; having avoided std::locale for precisely the "common
critical problems" mentioned, this introduction was useful. As others
have said, a similar introduction to icu, perhaps even deep links to
appropriate parts of its documentation, and perhaps also to Unicode
docs would be useful. I found that it was not sufficiently easy to
find which header each function is in, e.g. on the
charset_handling.html page the name of the header is not mentioned; if
I click on e.g. from_utf I get to a reference page that also doesn't
name the header.

Use of strings rather than symbols: there are a few places where the
library uses strings such as "UTF-8" or "de_DE.UTF8" rather than
symbols like utf_8 or de_DE_UTF8. This use of strings can make things
run-time errors that could otherwise be compile-time errors.
Mis-typing e.g. "UTF_8" or "de-DE" is too easy. Perhasp in some cases
the domain of the parameter is not known at compile time, but even in
those cases, it should be possible to provide symbols for the most
common choices.

Collation: I would be interested to know how fast these comparisons are
compared to regular std::string::operator<. If I correctly understand
what transform does (more docs needed), it would be interesting to then
compare the performance of std::map<...,comparator> with a std::map<>
storing transformed strings. Or similarly, to compare std::sort using
the comparator with std::sort of a pair<transformed string, original string>.

Ideally these collation functions would take character ranges of
arbitrary types, however I guess that the icu API takes char* making
this unreasonable. Apparently there is no option for a NULL-terminated
char*, unlike the case conversion functions, right?

Case conversions: I don't know what "title case" is or what "fold case"
means; the docs should define this. At least linking to the
definitions of the Unicode normalisation forms would also be useful;
preferably, some explanation of how they should be used would be included.

I can think of two recent case-conversion problems that I have had,
both related to geographic names:

1. I had a list of North American names, in "UPPER CASE". I decided to
convert them to "Initial Capitals" for better appearance, but later
realised that this was broken for accented names in Quebec. I could
imagine some sort of "case conversion is lossy" property of a locale
that would detect this; does anything like that exist? (I guess not.)

2. In a similar case I found that " 's " was being wrongly capitalised,
e.g. "St. John'S Street". (This was with PostgreSQL's case conversion
functions.) Again it would be useful to know whether (a) does the
library try to handle this sort of thing correctly, and (b) can it tell
me whether the operation that I want to do is lossy or not? For
example, say I have a set of place names that are a mixture of en_US
and en_CA and fr_CA but I don't know which; can I perform a conversion
that works because those locales all have the same rules for e.g. ' and
-? (I guess not.)

Character set conversion: I think it's great that Boost might finally
get character set conversion - it's overdue! However, I'm not
enthusiastic about this:

   std::string utf8_string = to_utf<char>(latin1_string,"Latin1");

1. It should be possible for the source character set to be a
compile-time constant symbol, and a template parameter e.g.
conv<latin1,utf8>(str). As it is, if I want to convert just a few
bytes, it will surely spend more time decoding the charset name string
than actually converting.

2. As above, ideally it should take a range of arbitrary type and
output to a back_insert_iterator; presumably this is an icu limitation.

3. The error handling seems to be basic - it's "all or nothing".
Considering from_utf, since this is the direction where there are more
things that can be unrepresentable, I might want to:
- Skip the character;
- Replace with '?';
- Replace with an unaccented version, or e.g. curly-quotes to plain-quotes;
- Throw an exception that aborts the whole operation;
- Do the conversion up to the point of difficulty and then indicate the problem.

4. There's also the case where the input is chunked, and the
chunk-breaks might fall within multi-byte characters; so individual
chunks are potentially malformed while the overall stream is OK. I
think the best way to handle this is with a stateful functor to do the
comparison; I wrote one using iconv, like this:

Iconver<latin1,utf8> latin1_to_utf8;

while (!end_of_input()) {
   output.append( latin1_to_utf8(input.get_chunk()) );
}
assert(latin1_to_utf8.empty()); // detect unconverted partial chars at
end of input

Also: how is the performance of this conversion? If it is being done
by icu, we should have a hook (i.e. specialisation, hence my suggestion
of template parameters for the character sets) so that we can provide
optimised versions of important conversions.

Here are my answers to the "normal questions":

    - What is your evaluation of the design?

Generally it seems like a reasonable design for an icu wrapper. If it
were not an icu wrapper, no doubt some things could be done better.

     - What is your evaluation of the implementation?

I've looked at very little of the source. I decided to look at
iconv_codepage.hpp since I have some experience with iconv.

I note that a union is being used in what I believe is an undefined way:

         union {
             char first;
             uint16_t u16;
             uint32_t u32;
         } v;

             v.u16 = 1;
             if(v.first == 1) {
                 return "UTF-16LE";
             }
             else {
                 return "UTF-16BE";
             }

In my own iconv wrapper code I need a const_cast because the POSIX spec
misses a const on the second argument to iconv(); some platforms
correct this, but not all:

#if defined(__FreeBSD__) || defined(__OpenBSD__) || defined(__NetBSD__)
|| defined(__sun__)
// Previously __APPLE__ was included in this list; presumably they have
// changed their headers. If you have an older system you may need to put
// it back.
#define ICONV_ARG2_IS_CONST
#endif

This doesn't seem to be done here.

There seems to be some use of reinterpret_cast which I think should be
a static_cast via void*:

     virtual std::string convert(char_type const *ubegin,char_type const *uend)
     {
         ....
         char const *begin = reinterpret_cast<char const *>(ubegin);

"int err" should probably be "errno_t err".

There seems to be a "magic constant" of 10 to allow for the difference
in size between the input and the output.

Some comments would help. For example, to describe the meaning of the
return values and error codes from iconv().

Based on that limited look, my view is that the code probably works,
but it could be improved in many ways.

     - What is your evaluation of the documentation?

Good enough.

     - What is your evaluation of the potential usefulness of the
       library?

Useful and overdue.

     - Did you try to use the library? With what compiler? Did you
       have any problems?

No, I've not compiled anything.

     - How much effort did you put into your evaluation? A glance? A
       quick reading? In-depth study?

3 hours reading the docs and looking at a tiny bit of the implementation.

     - Are you knowledgeable about the problem domain?

Not beyond the areas that I have discussed above.

> Please explicitly state in your review whether the library should be
> accepted.

It's borderline. I was going to say "yes, but please consider updating
at least the charset conversion functions in the way that I proposed".
But then I looked at the source and noted what I think are well-known
issues like abuse of union and reinterpret_cast, which have lowered my
opinion a bit. And then I've seen some other reviews that have looked
at the areas I've not considered (i.e. message translation) and they
have been quite critical. Based on all of that, my view has shifted to
"no, but this is worthwhile functionality, so please come back when
you've resolved some of these concerns".

Regards, Phil.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk