Boost logo

Boost :

From: Andrey Semashev (andysem_at_[hidden])
Date: 2007-06-21 16:56:44


Mathias Gaunard wrote:
> Andrey Semashev wrote:
>
>> I'd rather stick to UTF-16 if I had to use
>> Unicode.
>
> UTF-16 is a variable-length encoding too.
>
> But anyway, Unicode itself is a variable-length format, even with the
> UTF-32 encoding, simply because of grapheme clusters.

Technically, yes. But most of the widely used character sets fit into
UTF-16. That means that I, having said that my app is localized to
languages A B and C, may treat UTF-16 as a fixed-length encoding if
these languages fit in it. If they don't, I'd consider moving to UTF-32.

>> I'm not saying that we don't need Unicode support. We do!
>> I'm only saying that in many cases plain ASCII does its job perfectly
>> well: logging, system messages, simple text formatting, texts in
>> restricted character sets, like numbers, phone numbers, identifiers of
>> all kinds, etc.
>
> Identifiers of all kinds aren't text, they're just bytes.

Not always. I may get such an identifier from a text-based protocol
primitive, thus I can handle it as a text. This assumption may allow
more opportunities to various optimizations.

> As for logging, I'm not too sure whether it should be localized or not.

I can think only of a single case where logging should i18n. It's when
you have to log external data, such as client app queries or DB
responses. This need questionable in the first place, because it may
introduce serious security holes. As for regular logging, I feel quite
fine with narrow logs and don't see why would I want to make them wide.

> And I don't understand what you mean by system messages.

Error and warning descriptions that may come either from your
application or from the OS, some third-party API or language runtime.
Although, I may agree that these messages could be localized too, but to
my mind it's an overkill. Generally, I don't need std::bad_alloc::what()
returning Russian or Chinese description.

> I still don't understand why you want to work with other character sets.

Because I have an impression that it may be done more efficiently and
with less expenses. I don't want to pay for what I don't need - IMHO,
the ground principle of C++.

> That will just require duplicating the tables and algorithms required to
> process the text correctly.

What algorithms do you mean and why would they need duplication?

> See http://www.unicode.org/reports/tr10/ for an idea of the complexity
> of collations, which allow comparison of strings.
> As you can see, it has little to do with encoding, yet the tables etc.
> require the usage of the Unicode character set, preferably in a
> canonical form so that it can be quite efficient.

The collation is just an approach to perform string comparison and
ordering. I don't see how it is related to efficiency questions I mentioned.
Besides, comparison is not the only operation on strings. I expect
iterating over a string or operator[] complexity to rise significantly
once we assume that the underlying string has variable-length chars.

>> There are cases where i18n is not needed at all - mostly
>> server-side apps with minimal UI.
>
> Any application that process or display non-trivial text (meaning
> something else than options) should have internationalization.

I have to disagree. I18n is good when it's needed, i.e. when there are
users that will appreciate it or when it's required by application
domain and functionality. Otherwise, IMO, it's waste of efforts on the
development stage and system resources on the evaluation stage.

> What encoding translation are you talking about?

Let's assume my app works with a narrow text file stream. If the stream
is using Unicode internally, it has to translate between the file
encoding and its internal encoding every time I output or input something.
I don't think that's the way it should be. I'd rather have an
opportunity to chose the encoding I want to work with and have it
through the whole formatting/streaming/IO tool chain with no extra
overhead. That doesn't mean, though, that I wouldn't want some day to
perform encoding translations with the same tools.

PS: I have a slight feeling that we have a misunderstanding at this point...


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk