Boost logo

Boost :

From: Jeremy Maitin-Shepard (jbms_at_[hidden])
Date: 2007-06-23 12:11:58


Andrey Semashev <andysem_at_[hidden]> writes:

[snip]

> I may not support character combining from several code points if it is
> not used or uncommon in languages A, B and C. Moreover, many precombined
> characters exist in Unicode as a single code point.

They do; it may be largely for compatibility reasons, I think. I don't
think it is a very good idea to attempt to provide partial support for
Unicode by supporting only single-code-point grapheme clusters.
Furthermore, I don't see that it would be a huge gain.

> [snip]

>>>> That will just require duplicating the tables and algorithms required to
>>>> process the text correctly.
>>
>>> What algorithms do you mean and why would they need duplication?
>>
>> Examples of such algorithms are string collation, comparison, line
>> breaking, word wrapping, and hyphenation.

> Why would these algorithms need duplication? If we have all
> locale-specific traits and tools, such as collation tables, character
> checking functions like isspace, isalnum, etc. along with new ones that
> might be needed for Unicode, encapsulated into locale classes, the
> essence of algorithms should be independent form the text encoding.

Using standard data tables, and a single algorithm that merely accesses
the locale-specific data tables, you can provide these algorithms for
UTF-16 (and other Unicode encodings) for essentially all locales. This
is done by libraries like IBM ICU. Providing them in addition for other
encodings, however, would require separate data tables and separate
implementations.

>>> Besides, comparison is not the only operation on strings. I expect
>>> iterating over a string or operator[] complexity to rise significantly
>>> once we assume that the underlying string has variable-length chars.
>>
>> The complexity remains the same if operator[] indexes over encoded
>> units, or you are iterating over the encoded units. Clearly, if you
>> want an iterator that converts from the existing encoding, which might
>> be UTF-8 or UTF-16, to UTF-32, then there will be greater complexity.
>> As stated previously, however, it is not clear why this is likely to be
>> a frequently useful operation.

> What do you mean by encoded units?

By encoded units I mean e.g. a single byte with utf-8, or a 16-bit
quantity with utf-16, as opposed to a code point.

> What I was saying, if we have a UTF-8 encoded string that contains both
> latin and national characters that encode to several octets, it becomes
> a non-trivial task to extract i-th character (not octet) from the
> string. Same problem with iteration - iterator has to analyze the
> character it points to to adjust its internal pointer to the beginning
> of the next character. The same thing will happen with true UTF-16 and
> UTF-32 support.
> As an example of the need in such functionality, it is widely used in
> various text parsers.

I'm still not sure I quite see it. I would think that the most common
case in parsing text is to read it in order from the beginning. I
suppose in some cases you might be parsing something where you know a
field is aligned to e.g. the 20th character, but such formats tend to
assume very simple encodings anyway, because they don't make much sense
if you are to support complicated accents and such.

-- 
Jeremy Maitin-Shepard

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk