Boost logo

Boost :

From: Miro Jurisic (macdev_at_[hidden])
Date: 2004-04-06 14:28:38


In article <200404061127.44141.ghost_at_[hidden]>,
 Vladimir Prus <ghost_at_[hidden]> wrote:

> it seems that Unicode support is the last issue that should be addressed
> before the library can be added to CVS. Since the issue is somewhat tricky,
> I'd appreciate some comments before I start coding.

Unicode is a non-trivial problem, and I strongly encourage you not to attempt to
seriously tackle Unicode in program_options without spending some time thinking
about more general issues of Unicode in boost and STL. As I see it, there are
only two things that can come out of this:

1. You really sit down and write an appropriate Unicode string abstraction for
boost, not tied to program_options

If this is the choice you make, then we should be discussing this separately
from program_options, and it should be designed separately; when its
implementation design begins, program_options can be the first client, of course.

2. You don't really try to solve the whole problem, and you do the minimal
amount of work needed for program_options to support Unicode, while ignoring the
larger issue of Unicode support in applications

In this case, you need to identify the minimal requirements you need to satisfy,
and design program_options appropriately.

I very strongly discourage you from doing anything in-between, because Unicode
becomes rather complex very very quickly when you decide to do something
non-trivial, and most likely attempting to do something between 1 and 2 will
take you down path before you know it. The complexity of Unicode and
internationalization in general cannot be underestimated.

That said, remarks on your design:

First of all, there is no guarantee that std::wstring is UCS4-encoded, nor even
that std::wstring is wide enough to hold a UCS4 code point. Because of the
extent to which wchar_t and std::wstring are platform-dependent, I would avoid
looking at them at all. (They are so platform-dependent that you can't declare a
wide character string literal and be assured that it will work on all reasonable
compilers -- because you don't know how wide your characters are.)

Given that, I would simply declare that the extent of Unicode support in
program_options will be that it supports UTF-8-encoded std::strings, in either
canonically decomposed form or canonically precomposed form. If you make those
assertions, you can take advantage of Unicode properties in the following two
ways:

Searching for a substring X of string Y can be done without regard for character
boundaries (because Unicode guarantees that characters are encoded to avoid
false positives in this scenario).

Strict string comparison can be done without regard for character boundaries
(because every character has precisely one encoding each canonical form).

Basically, those two assumptions allow you to get as close to manipulating
strings without considering character boundaries as you can, and IMNSHO that's
the best you can do unless you want to design a real Unicode abstraction.

meeroh

-- 
If this message helped you, consider buying an item
from my wish list: <http://web.meeroh.org/wishlist>

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk