Boost logo

Boost :

From: Andrey Semashev (andysem_at_[hidden])
Date: 2007-06-23 08:26:50


I've already answered to Jeremy Maitin-Shepard on some for his
arguments, which are similar to yours. So I might be repeating myself.

Mathias Gaunard wrote:
> Andrey Semashev wrote:
>
>> Technically, yes. But most of the widely used character sets fit into
>> UTF-16. That means that I, having said that my app is localized to
>> languages A B and C, may treat UTF-16 as a fixed-length encoding if
>> these languages fit in it. If they don't, I'd consider moving to UTF-32.
>
> That can be used as an optimization.
> The container still should only support bidirectional traversal, for
> characters, that is. An Unicode string should also support a more
> efficient byte-like traversal.
>
> And anyway, you forgot the part about grapheme clusters.
> Since you don't seem to know what they are, even though I mentioned them
> several times, I will shortly explain that to you.

[snip]

Thank you for explaining this to me. I've heard and read of such
character combining but I never had to support it in practice.

But as you noted yourself, there are many precombined single code point
characters in Unicode. I'm not aware of the amount of such characters
but I tend to think that they cover majority of the commonly used
combine code sequences. This, in conjunction with the fact that I
support a limited set of languages in my example application, allows me
to perform the aforementioned optimizations.

The CJK languages, of course, is a whole different story, and in order
to support them we need a true Unicode processing. There's no argument
on my side on this.

>> Not always. I may get such an identifier from a text-based protocol
>> primitive, thus I can handle it as a text. This assumption may allow
>> more opportunities to various optimizations.
>
> Text in a given human language is not exactly the same as textual data
> which may not use any word.

Yes, but it doesn't prevent me from processing it as a text, does it?

>> Error and warning descriptions that may come either from your
>> application or from the OS, some third-party API or language runtime.
>> Although, I may agree that these messages could be localized too, but to
>> my mind it's an overkill. Generally, I don't need std::bad_alloc::what()
>> returning Russian or Chinese description.
>
> Not localizing your error messages is probably the worst thing you can do.
> I'm pretty sure the user would be frustrated if he gets an error in a
> language he doesn't understand well.

Maybe. And maybe not, if the only one who sees these messages is a
mature system administrator, and the messages are in English. Once
again, I was speaking of a server-side applications. I understand,
though, that such cases may not be the common ones.

>> Because I have an impression that it may be done more efficiently and
>> with less expenses. I don't want to pay for what I don't need - IMHO,
>> the ground principle of C++.
>
> Well then you can't have an unified text processing facility for all
> languages, which is the point of Unicode.
>
>
>> What algorithms do you mean and why would they need duplication?
>
> The algorithms defined by the Unicode Standard, like the collation one,
> along with the many tables it requires to do its job.
>
> Those algorithms and tables are defined for Unicode, and it can be more
> or less difficult to adapt them to another character set.

As I noted to Jeremy, I think all locale-specific stuff should be
encapsulated in locales. Therefore the processing algorithms are left
independent from encoding specifics.

>> I expect
>> iterating over a string or operator[] complexity to rise significantly
>> once we assume that the underlying string has variable-length chars.
>
> Iterating over the "true" characters would be a ridiculously inefficient
> operation -- especially if wanting to keep the guarantee that modifying
> the value pointed by an iterator doesn't invalidate the others --, and
> should be clearly avoided.
> I don't think there is much code in high-level programming languages
> that iterate over the strings.

Text parsing is one of such examples. And it may be extremely
performance critical.

>>> What encoding translation are you talking about?
>> Let's assume my app works with a narrow text file stream. If the stream
>> is using Unicode internally, it has to translate between the file
>> encoding and its internal encoding every time I output or input something.
>> I don't think that's the way it should be. I'd rather have an
>> opportunity to chose the encoding I want to work with and have it
>> through the whole formatting/streaming/IO tool chain with no extra
>> overhead. That doesn't mean, though, that I wouldn't want some day to
>> perform encoding translations with the same tools.
>
> The stream shouldn't be using any text representation or facility, but
> only be a convenience to write stuff in an agnostic way.
> Of course, the text processing layer which IMO should be quite separate
> will probably work with Unicode, but you don't have to use it.
>
> You should be working with Unicode internally in your app anyway if you
> want to avoid translations, since most systems or toolkits require
> Unicode in some form in their interfaces.

I'm not sure about the "most" word in context of "require". I'd rather
say "most allow Unicode". But that does not mean that all strings in C++
should be in Unicode and I should always work in it. I just want to have
a choice, after all.
Additionally, there is plenty of already written code that does not use
Unicode. We can't just throw it away.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk