Boost logo

Boost :

From: Jeremy Maitin-Shepard (jbms_at_[hidden])
Date: 2007-06-23 16:21:36


Andrey Semashev <andysem_at_[hidden]> writes:

> Jeremy Maitin-Shepard wrote:
>> Andrey Semashev <andysem_at_[hidden]> writes:

>>>>>> That will just require duplicating the tables and algorithms required to
>>>>>> process the text correctly.
>>>>> What algorithms do you mean and why would they need duplication?
>>>> Examples of such algorithms are string collation, comparison, line
>>>> breaking, word wrapping, and hyphenation.
>>
>>> Why would these algorithms need duplication? If we have all
>>> locale-specific traits and tools, such as collation tables, character
>>> checking functions like isspace, isalnum, etc. along with new ones that
>>> might be needed for Unicode, encapsulated into locale classes, the
>>> essence of algorithms should be independent form the text encoding.
>>
>> Using standard data tables, and a single algorithm that merely accesses
>> the locale-specific data tables, you can provide these algorithms for
>> UTF-16 (and other Unicode encodings) for essentially all locales. This
>> is done by libraries like IBM ICU. Providing them in addition for other
>> encodings, however, would require separate data tables and separate
>> implementations.

> I still can't see why one would need to reimplement algorithms. Their
> logic is the same regardless of the encoding.

I'll admit I haven't looked closely at the collation algorithm given by
the Unicode specifications recently, so it is hard for me to give
details. String collation is in general lexicographical on the grapheme
clusters, but for some languages there may be certain exceptions
(someone please correct me if I am mistaken). Perhaps someone with more
knowledge can elaborate, but I believe the Unicode collation algorithms
are indeed highly specific to Unicode.

>>> What I was saying, if we have a UTF-8 encoded string that contains both
>>> latin and national characters that encode to several octets, it becomes
>>> a non-trivial task to extract i-th character (not octet) from the
>>> string. Same problem with iteration - iterator has to analyze the
>>> character it points to to adjust its internal pointer to the beginning
>>> of the next character. The same thing will happen with true UTF-16 and
>>> UTF-32 support.
>>> As an example of the need in such functionality, it is widely used in
>>> various text parsers.
>>
>> I'm still not sure I quite see it. I would think that the most common
>> case in parsing text is to read it in order from the beginning. I
>> suppose in some cases you might be parsing something where you know a
>> field is aligned to e.g. the 20th character,

> There may be different parsing techniques, depending on the text format.
> Sometimes only character iteration is sufficient, in case of forward
> sequential parsing. There is no restriction, though, to perform
> non-sequential parsing (in case if there is some table of contents with
> offsets or each field to be parsed is prepended with its length).

Such a format would likely then not really be text, since it would
contain embedded offsets (which might likely not be text). But in any
case, the offsets could simply be provided as byte offsets (or encoded
unit offsets), rather than character or grapheme cluster offsets, and
then there is no problem.

Note: I'm using the term "encoded unit" because I can't recall the
proper term.

>> but such formats tend to
>> assume very simple encodings anyway, because they don't make much sense
>> if you are to support complicated accents and such.

> If all standard algorithms and classes assume that the text being parsed
> is in Unicode, it cannot perform optimizations in a more efficient
> manner. The std::string or regex or stream classes will always have to
> treat the text as Unicode.

Well, since std::string and boost::regex already exist and do not assume
Unicode (or even necessarily support it very well; I've seen some
references to boost::regex providing Unicode support, but I haven't
looked into it), that is not likely to occur.

I think it is certainly important to provide some support for
non-Unicode encodings. In particular, converting between arbitrary
encodings should certainly be supported. Depending on to what extent
your parsing/processing relies on library text processing facilities
above basic encoding conversion, it may or may not be feasible to
directly process non-Unicode text if only this very basic level of
support is provided.

It would be useful to explore how much trouble it is to support
arbitrary non-Unicode encodings, and also to explore how useful it is to
be able to format/parse numbers, dates (and perhaps currencies), for
instance, in non-Unicode encodings.

-- 
Jeremy Maitin-Shepard

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk