Boost logo

Boost :

From: Vladimir Prus (ghost_at_[hidden])
Date: 2004-04-07 01:52:54


Hi Miro,

>> it seems that Unicode support is the last issue that should be addressed
>> before the library can be added to CVS. Since the issue is somewhat
>> tricky, I'd appreciate some comments before I start coding.
>
> Unicode is a non-trivial problem, and I strongly encourage you not to
> attempt to seriously tackle Unicode in program_options without spending
> some time thinking about more general issues of Unicode in boost and STL.
> As I see it, there are only two things that can come out of this:
>
> 1. You really sit down and write an appropriate Unicode string abstraction
> for boost, not tied to program_options
>
> If this is the choice you make, then we should be discussing this
> separately from program_options, and it should be designed separately;
> when its implementation design begins, program_options can be the first
> client, of course.

I'd like to avoid this my all means. I'm not a unicode expert, but what I've
learned already sounds complex enough. Besides, competing with existing
solutions like ICU (http://oss.software.ibm.com/icu/) is not so good idea.

> 2. You don't really try to solve the whole problem, and you do the minimal
> amount of work needed for program_options to support Unicode, while
> ignoring the larger issue of Unicode support in applications
>
> In this case, you need to identify the minimal requirements you need to
> satisfy, and design program_options appropriately.

Right. Here's are the requirements:

1. When declaring each option, one should be able to specify whether the
value should be parsed using unicode, or ascii. If it should be parsed
using unicode, all unicode issues (e.g. normalization), are up to the
client.

2. Each parser should have ascii and unicode version. How unicode string is
obtained is up to the client.

3. The library guarantees that
- ascii input is passed to a ascii value without change
- unicode input is passed to a unicode value without change
- ascii input passed to the unicode value, and
- unicode input passed to the ascii value will be converted using codecvt
facet (which can be specified by the user)

Essentially, the library will allow to pass though both ascii and unicode
strings, unmodified.

> That said, remarks on your design:
>
> First of all, there is no guarantee that std::wstring is UCS4-encoded, nor
> even that std::wstring is wide enough to hold a UCS4 code point. Because
> of the extent to which wchar_t and std::wstring are platform-dependent, I
> would avoid looking at them at all. (They are so platform-dependent that
> you can't declare a wide character string literal and be assured that it
> will work on all reasonable compilers -- because you don't know how wide
> your characters are.)

Oh, I have to agree with this. Even though characters outside BMP are rare,
it's better not use wstring.

> Given that, I would simply declare that the extent of Unicode support in
> program_options will be that it supports UTF-8-encoded std::strings, in
> either canonically decomposed form or canonically precomposed form. If you
> make those assertions, you can take advantage of Unicode properties in the
> following two ways:

I think there's no need to require that std::string passed to
program_options is in UTF-8. It's better to allow user to specify codecvt
facet for converting char* into unicode. So, one can use 8-bit encoding
specified by locale, or use UTF-8, as he likes.
(Besides, codecvt uses wchar_t -- does it mean it can't really be used for
unicode too?)

> Searching for a substring X of string Y can be done without regard for
> character boundaries (because Unicode guarantees that characters are
> encoded to avoid false positives in this scenario).

It might be even simpler: since I only look for characters in ascii, so I
even don't care about canonical form -- I believe all ascii characters are
unambigous (I mean strict 7-bit ascii).

> Strict string comparison can be done without regard for character
> boundaries (because every character has precisely one encoding each
> canonical form).

For now, I don't plan to support Unicode in option names, so string
comparison is not yet needed.

> Basically, those two assumptions allow you to get as close to manipulating
> strings without considering character boundaries as you can, and IMNSHO
> that's the best you can do unless you want to design a real Unicode
> abstraction.

Thanks for your comments!

- Volodya


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk