Boost logo

Boost :

From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2007-06-22 19:58:42


Andrey Semashev wrote:

> Technically, yes. But most of the widely used character sets fit into
> UTF-16. That means that I, having said that my app is localized to
> languages A B and C, may treat UTF-16 as a fixed-length encoding if
> these languages fit in it. If they don't, I'd consider moving to UTF-32.

That can be used as an optimization.
The container still should only support bidirectional traversal, for
characters, that is. An Unicode string should also support a more
efficient byte-like traversal.

And anyway, you forgot the part about grapheme clusters.
Since you don't seem to know what they are, even though I mentioned them
several times, I will shortly explain that to you.

In Unicode, it is possible to use combining characters to create new
characters, and those may be equivalent to other existing ready-made
characters (which may or may not exist).
For example, "dé" might be represented by the two code points 'd' (100)
and 'é' (233) -- that's e with an acute accent, in case you can't see
it. (in utf-8, 'é' would be the two bytes [195, 169]).
It might also be represented by the three code points 'd' (100), 'e'
(101) and combining acute accent (769) (The combining acute accent
being, in utf-8, the two bytes [204, 128]).
The character 'é', described as a combining sequence of the 'e' code
point and the combining acute accent is equivalent to the ready-made
'é'. That's one unique character and it shouldn't be split in the
middle, obviously, since that would alter the meanings of other
characters or potentially invalidate the string.

Some characters may actually use even more other combining points, it's
not limited to one. Of course, there is a canonical ordering.
There are of course other uses than accents for such things. In Hangul
(korean) the characters can be written by combining different parts of
their ideograph (from what I understood).

As you can see, characters may lie on top of a variable number of code
points.
Of course processing of such text can be simplified by maintaining the
strings in a canonical state, like Normalization Form C.

> Not always. I may get such an identifier from a text-based protocol
> primitive, thus I can handle it as a text. This assumption may allow
> more opportunities to various optimizations.

Text in a given human language is not exactly the same as textual data
which may not use any word.

> Error and warning descriptions that may come either from your
> application or from the OS, some third-party API or language runtime.
> Although, I may agree that these messages could be localized too, but to
> my mind it's an overkill. Generally, I don't need std::bad_alloc::what()
> returning Russian or Chinese description.

Not localizing your error messages is probably the worst thing you can do.
I'm pretty sure the user would be frustrated if he gets an error in a
language he doesn't understand well.

> Because I have an impression that it may be done more efficiently and
> with less expenses. I don't want to pay for what I don't need - IMHO,
> the ground principle of C++.

Well then you can't have an unified text processing facility for all
languages, which is the point of Unicode.

> What algorithms do you mean and why would they need duplication?

The algorithms defined by the Unicode Standard, like the collation one,
along with the many tables it requires to do its job.

Those algorithms and tables are defined for Unicode, and it can be more
or less difficult to adapt them to another character set.

> The collation is just an approach to perform string comparison and
> ordering.

It's not "an approach". It is "the approach".
This is what you need if you want to order strings (in a human way) or
match loosely. (case insensitive search or stuff like that)

  I don't see how it is related to efficiency questions I mentioned.
> Besides, comparison is not the only operation on strings.

String searching and comparison are probably the most used things in a
string.

> I expect
> iterating over a string or operator[] complexity to rise significantly
> once we assume that the underlying string has variable-length chars.

Iterating over the "true" characters would be a ridiculously inefficient
operation -- especially if wanting to keep the guarantee that modifying
the value pointed by an iterator doesn't invalidate the others --, and
should be clearly avoided.
I don't think there is much code in high-level programming languages
that iterate over the strings.

>> What encoding translation are you talking about?
>
> Let's assume my app works with a narrow text file stream. If the stream
> is using Unicode internally, it has to translate between the file
> encoding and its internal encoding every time I output or input something.
> I don't think that's the way it should be. I'd rather have an
> opportunity to chose the encoding I want to work with and have it
> through the whole formatting/streaming/IO tool chain with no extra
> overhead. That doesn't mean, though, that I wouldn't want some day to
> perform encoding translations with the same tools.

The stream shouldn't be using any text representation or facility, but
only be a convenience to write stuff in an agnostic way.
Of course, the text processing layer which IMO should be quite separate
will probably work with Unicode, but you don't have to use it.

You should be working with Unicode internally in your app anyway if you
want to avoid translations, since most systems or toolkits require
Unicode in some form in their interfaces.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk