Boost logo

Boost :

From: Andrey Semashev (andysem_at_[hidden])
Date: 2007-06-23 07:36:55


Jeremy Maitin-Shepard wrote:
> Andrey Semashev <andysem_at_[hidden]> writes:
>
>> Mathias Gaunard wrote:
>>> Andrey Semashev wrote:
>>>
>>>> I'd rather stick to UTF-16 if I had to use
>>>> Unicode.
>>> UTF-16 is a variable-length encoding too.
>>>
>>> But anyway, Unicode itself is a variable-length format, even with the
>>> UTF-32 encoding, simply because of grapheme clusters.
>
>> Technically, yes. But most of the widely used character sets fit into
>> UTF-16. That means that I, having said that my app is localized to
>> languages A B and C, may treat UTF-16 as a fixed-length encoding if
>> these languages fit in it. If they don't, I'd consider moving to
>> UTF-32.
>
> Note that even if you can represent a single Unicode code point in your
> underlying type for storing a single unit of encoded text, you still
> have the issue of combining characters and such. Thus, it is not clear
> how a fixed-width encoding makes text processing significantly easier;
> I'd be interested if you have some examples where it does make
> processing significantly easier.

I may not support character combining from several code points if it is
not used or uncommon in languages A, B and C. Moreover, many precombined
characters exist in Unicode as a single code point.

[snip]

>>> That will just require duplicating the tables and algorithms required to
>>> process the text correctly.
>
>> What algorithms do you mean and why would they need duplication?
>
> Examples of such algorithms are string collation, comparison, line
> breaking, word wrapping, and hyphenation.

Why would these algorithms need duplication? If we have all
locale-specific traits and tools, such as collation tables, character
checking functions like isspace, isalnum, etc. along with new ones that
might be needed for Unicode, encapsulated into locale classes, the
essence of algorithms should be independent form the text encoding.

>> Besides, comparison is not the only operation on strings. I expect
>> iterating over a string or operator[] complexity to rise significantly
>> once we assume that the underlying string has variable-length chars.
>
> The complexity remains the same if operator[] indexes over encoded
> units, or you are iterating over the encoded units. Clearly, if you
> want an iterator that converts from the existing encoding, which might
> be UTF-8 or UTF-16, to UTF-32, then there will be greater complexity.
> As stated previously, however, it is not clear why this is likely to be
> a frequently useful operation.

What do you mean by encoded units?
What I was saying, if we have a UTF-8 encoded string that contains both
latin and national characters that encode to several octets, it becomes
a non-trivial task to extract i-th character (not octet) from the
string. Same problem with iteration - iterator has to analyze the
character it points to to adjust its internal pointer to the beginning
of the next character. The same thing will happen with true UTF-16 and
UTF-32 support.
As an example of the need in such functionality, it is widely used in
various text parsers.

[snip]

>>> What encoding translation are you talking about?
>
>> Let's assume my app works with a narrow text file stream.
>
> For simplicity, we can avoid using the "narrow"/"wide" terminology and
> say you have a text file encoded using a 1-byte fixed width encoding,
> like ASCII or iso-8859-1.

That was exactly what I meant by term "narrow". :) But I'm happy to say
"1-byte fixed width encoding" instead of it to reduce misunderstanding.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk