Boost logo

Boost :

Subject: Re: [boost] [locale] Review results for Boost.Locale library
From: Jeremy Maitin-Shepard (jeremy_at_[hidden])
Date: 2011-04-25 16:31:04


The most significant complaint seems to be the fact that the translation
interface is limited to ASCII (or maybe UTF-8 is also supported, it
isn't entirely clear).

Even though various arguments have been made for using only ASCII text
literals in the program, it seems that it would be relatively easy to
support other languages. As has been mentioned by someone else, even if
the text really is in English, ASCII may not be sufficient as it may be
desirable to include some special symbol (e.g. the copyright symbol for
instance), and having to deal with this by creating a translation from
"ASCII English to appease translation system" to "Real English to
display to users" would seem to be an unjustifiable additional burden.
However, I don't think anyone is as familiar with the limitations of
gettext-related tools as Artyom, so he is the best person to discuss
exactly how this might be supported. Previously he briefly described a
make-shift approach that required the use of a macro, which didn't seem
like a legitimate solution.

It seems that xgettext (at least the version 0.18.1 that I tested on my
machine) supports non-ASCII program source provided that the --from-code
option is given, so it seems that the user could keep the source code in
any arbitrary character set/encoding and it would still work (and simply
convert the strings to UTF-8). It also appears to successfully extract
strings that are specified with a L prefix, so it seems that should not
be a problem either.

I suppose there is some question as to how well existing tools for
translating the messages deal with non-ASCII, but as the tools can be
improved fairly easily if necessary, I don't think this is a significant
concern.

We can assume that the compiler knows the correct character set of the
source code file, as trying to fool it would seem to be inherently error
prone. This seems to rule out the possibility of char * literals
containing UTF-8 encoded text on MSVC, until C++1x Unicode literals are
supported.

The biggest nuisance is that we need to know the compile-time character
set/encoding (so that we know how to interpret
"narrow" string literals), and there does not appear to be any standard
way in which this is recorded (maybe I'm mistaken though). However, it
is easy enough for the user to simply specify this as a preprocessor
define (the build system could add it to the compile flags, and it needs
to be known anyway in order to invoke xgettext --- presumably it would
just be based on the active locale at the time the compiler is invoked).
  If none is specified, it could default to UTF-8 (this can also be used
for greater efficiency in the case that the compile-time encoding is not
UTF-8 but the source code happens to only contain ASCII messages).

By knowing the compile-time character set, all ambiguity is removed.
The translation database can be assumed to be keyed based on UTF-8, so
to translate a message, it needs to be converted to UTF-8. There should
presumably be versions of the translation functions that take narrow
strings, wide strings, and additional versions for the C++1x unicode
literals once they are supported by compilers (I expect that to be very
soon, at least for some compilers). If a wide string is specified, it
will be assumed to be in UTF-16 or UTF-32 depending on sizeof(wchar_t),
and converted to UTF-8. UTF-32 is generally undesirable, I imagine, but
in practice should nonetheless work and using wide strings might be the
best approach for code that needs to compile on both Windows and Linux.
  For the narrow version, if the compile-time narrow encoding is UTF-8,
the conversion is a no-op. Otherwise, the conversion will have to be
done. (The C++1x u8 literal version would naturally require no
conversion also.)

Note that the common case of UTF-8 narrow literals, which is the only
case currently supported, there would be no performance penalty.

The documentation could explicitly warn that there is a performance
penalty for not using UTF-8, but I think this penalty is likely to be
acceptable in many cases.

If normalization proves to be an issue, then the conversion to UTF-8
could include normalization (perhaps another preprocessor definition)
and the output of xgettext could also be normalized.

I imagine relative to the work required for the whole library, these
changes would be quite trivial, and might very well transform the
library from completely unacceptable to acceptable for a number of
objectors on the list, while having essentially no impact on those that
are happy to use the library as is.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk