Boost logo

Boost :

From: Jeremy Maitin-Shepard (jbms_at_[hidden])
Date: 2007-06-21 17:57:33


Andrey Semashev <andysem_at_[hidden]> writes:

> Mathias Gaunard wrote:
>> Andrey Semashev wrote:
>>
>>> I'd rather stick to UTF-16 if I had to use
>>> Unicode.
>>
>> UTF-16 is a variable-length encoding too.
>>
>> But anyway, Unicode itself is a variable-length format, even with the
>> UTF-32 encoding, simply because of grapheme clusters.

> Technically, yes. But most of the widely used character sets fit into
> UTF-16. That means that I, having said that my app is localized to
> languages A B and C, may treat UTF-16 as a fixed-length encoding if
> these languages fit in it. If they don't, I'd consider moving to
> UTF-32.

Note that even if you can represent a single Unicode code point in your
underlying type for storing a single unit of encoded text, you still
have the issue of combining characters and such. Thus, it is not clear
how a fixed-width encoding makes text processing significantly easier;
I'd be interested if you have some examples where it does make
processing significantly easier.

[snip]

>> As for logging, I'm not too sure whether it should be localized or not.

> I can think only of a single case where logging should i18n. It's when
> you have to log external data, such as client app queries or DB
> responses. This need questionable in the first place, because it may
> introduce serious security holes. As for regular logging, I feel quite
> fine with narrow logs and don't see why would I want to make them
> wide.

I don't think the narrow/wide terminology is very helpful for discussing
the issues here. I think of internationalization as supporting multiple
languages/locales at run-time (or conceivably just at compile-time),
which would mean supporting multiple languages and formatting
conventions for the log messages. Whether this is useful obviously
depends on the intended use for the logs. The issue of external text
data in log messages is really just a particular instance of outputting
some representation of program data to the log message, and need not
really be considered here. (It may, for instance, involve some sort of
encoding conversion, or using some sort of escape syntax.) Note that
even if log messages are only in a single language, there is still the
issue of how the text of the messages is to be represented (i.e. what
encoding to use).

[snip]

>> I still don't understand why you want to work with other character sets.

> Because I have an impression that it may be done more efficiently and
> with less expenses. I don't want to pay for what I don't need - IMHO,
> the ground principle of C++.

It is important to support this principle. It is useful to consider
exactly what the costs and benefits are of standardizing on a single
encoding (likely UTF-16) for high-level text processing. In some cases
trying to use templates to avoid some people paying for what they don't
need results in everyone paying, in compile-time, possibly in developer
time, and sometimes in run-time due to compilers not being perfect.

>> That will just require duplicating the tables and algorithms required to
>> process the text correctly.

> What algorithms do you mean and why would they need duplication?

Examples of such algorithms are string collation, comparison, line
breaking, word wrapping, and hyphenation.

>> See http://www.unicode.org/reports/tr10/ for an idea of the complexity
>> of collations, which allow comparison of strings.
>> As you can see, it has little to do with encoding, yet the tables etc.
>> require the usage of the Unicode character set, preferably in a
>> canonical form so that it can be quite efficient.

> The collation is just an approach to perform string comparison and
> ordering. I don't see how it is related to efficiency questions I mentioned.
> Besides, comparison is not the only operation on strings. I expect
> iterating over a string or operator[] complexity to rise significantly
> once we assume that the underlying string has variable-length chars.

The complexity remains the same if operator[] indexes over encoded
units, or you are iterating over the encoded units. Clearly, if you
want an iterator that converts from the existing encoding, which might
be UTF-8 or UTF-16, to UTF-32, then there will be greater complexity.
As stated previously, however, it is not clear why this is likely to be
a frequently useful operation.

>>> There are cases where i18n is not needed at all - mostly
>>> server-side apps with minimal UI.
>>
>> Any application that process or display non-trivial text (meaning
>> something else than options) should have internationalization.

> I have to disagree. I18n is good when it's needed, i.e. when there are
> users that will appreciate it or when it's required by application
> domain and functionality. Otherwise, IMO, it's waste of efforts on the
> development stage and system resources on the evaluation stage.

It is true that forcing certain text operations to be done in UTF-16 (as
opposed to a fixed-width 1-byte encoding) would slow down certain text
processing. In some cases, using UTF-8 would not hurt performance. It
is not clear

>> What encoding translation are you talking about?

> Let's assume my app works with a narrow text file stream.

For simplicity, we can avoid using the "narrow"/"wide" terminology and
say you have a text file encoded using a 1-byte fixed width encoding,
like ASCII or iso-8859-1.

> If the stream is using Unicode internally, it has to translate between
> the file encoding and its internal encoding every time I output or
> input something. I don't think that's the way it should be. I'd
> rather have an opportunity to chose the encoding I want to work with
> and have it through the whole formatting/streaming/IO tool chain with
> no extra overhead. That doesn't mean, though, that I wouldn't want
> some day to perform encoding translations with the same tools.

I can see the use for this. I think it may well be important for the
new I/O framework to support streams of arbitrary types and encodings.
There is an issue though in attempting to have the text formatting
system support arbitrary encodings:

I think we all agree that it needs to support a large number of locales,
and by locale I mean a set of formatting conventions, for formatting
e.g. numbers and dates, and also some other things that you may not care
as much about, like how to order strings. If multiple encodings are
also supported, then either encoding conversions would have to be done,
which is what you want to avoid, or the formatting and other information
needs to duplicated for each supported encoding for each locale which
would mean the amount of data that needs to be stored would be doubled
or tripled. Since all but the needed data can likely remain on disk,
however, this may be reasonable. One strategy would be to necessarily
store the data for all locales in at least one of the Unicode encodings
(or maybe UTF-16 would be required), and then implementations can
provide the data in other encodings for locales as well (this data can
likely be generated automatically from the UTF-16 data); some data like
collation data would likely be provided only for Unicode encodings, so
collation might only be provided for Unicode encodings. Then at run
time, to format text given a particular locale and encoding, if there is
already data for that locale in the desired encoding, it can be used
without conversion; otherwise, the Unicode data is converted to the
desired locale.

-- 
Jeremy Maitin-Shepard

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk