Boost logo

Boost :

From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2007-06-24 18:47:02


Andrey Semashev wrote:

> Text parsing is one of such examples. And it may be extremely
> performance critical.

Text parsing being quite low-level, they should probably use lower-level
accesses (iterating over code points or code units for example).

Extensive parsing should probably access lower-level views of the
string, like code points or code units, and eventually be careful
depending on what they do.
Various Unicode related tools (text boundaries searching etc.) would be
needed to assist the parser in this task.

Building a fully Unicode-aware regex engine is probably difficult.
See the guidelines here: http://unicode.org/unicode/reports/tr18/
Boost.Regex -- which makes use of ICU for Unicode support -- for
example, does not even fully comply to level 1.

>> You should be working with Unicode internally in your app anyway if you
>> want to avoid translations, since most systems or toolkits require
>> Unicode in some form in their interfaces.
>
> I'm not sure about the "most" word in context of "require". I'd rather
> say "most allow Unicode".

I know of several libraries or APIs that only work with Unicode. It's
simply easier for them if there is only one format that represent all text.
GTK+ is one example.

> But that does not mean that all strings in C++
> should be in Unicode and I should always work in it. I just want to have
> a choice, after all.

> Additionally, there is plenty of already written code that does not use
> Unicode. We can't just throw it away.

Compatibility with legacy code will always be an issue.
Isn't a runtime conversion simply acceptable?


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk