Boost logo

Boost :

From: Andrey Semashev (andysem_at_[hidden])
Date: 2007-06-25 14:55:23


Mathias Gaunard wrote:
> Andrey Semashev wrote:
>
>> Text parsing is one of such examples. And it may be extremely
>> performance critical.
>
> Text parsing being quite low-level, they should probably use lower-level
> accesses (iterating over code points or code units for example).
>
> Extensive parsing should probably access lower-level views of the
> string, like code points or code units, and eventually be careful
> depending on what they do.

I agree that parsing is rather a low-level task. But I see no benefit
from being forced to parse Unicode code points instead of fixed-length
chars in a given encoding.

> Various Unicode related tools (text boundaries searching etc.) would be
> needed to assist the parser in this task.

That would be nice.

> Building a fully Unicode-aware regex engine is probably difficult.
> See the guidelines here: http://unicode.org/unicode/reports/tr18/
> Boost.Regex -- which makes use of ICU for Unicode support -- for
> example, does not even fully comply to level 1.

Interesting. I wonder what level of support will be proposed to the
Standartization Comitee.

>>> You should be working with Unicode internally in your app anyway if you
>>> want to avoid translations, since most systems or toolkits require
>>> Unicode in some form in their interfaces.
>> I'm not sure about the "most" word in context of "require". I'd rather
>> say "most allow Unicode".
>
> I know of several libraries or APIs that only work with Unicode. It's
> simply easier for them if there is only one format that represent all text.
> GTK+ is one example.

Well, that doesn't mean I was wrong in my statement. :)

>> But that does not mean that all strings in C++
>> should be in Unicode and I should always work in it. I just want to have
>> a choice, after all.
>
>> Additionally, there is plenty of already written code that does not use
>> Unicode. We can't just throw it away.
>
> Compatibility with legacy code will always be an issue.
> Isn't a runtime conversion simply acceptable?

I don't think so - we're recurring to the performance issue.

I just don't understand why there's so strong will to cut down fixed
char encodings in favor of exclusive Unicode support. Why can't we have
both? Is it for that the text processing algorithms should be
duplicated? I think not, if the implementation is well designed. Is it
for CRT size growth because of some encoding-specific data? Possible,
but not necessarily. In fact, if the application size is of primary
concern, the whole Unicode support is a good candidate to cut away.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk