Boost logo

Boost :

From: Miro Jurisic (macdev_at_[hidden])
Date: 2004-04-07 04:24:56

In article <c508gd$o78$1_at_[hidden]>, Vladimir Prus <ghost_at_[hidden]> wrote:

> > 1. You really sit down and write an appropriate Unicode string abstraction
> > for boost, not tied to program_options
> >
> > If this is the choice you make, then we should be discussing this
> > separately from program_options, and it should be designed separately;
> > when its implementation design begins, program_options can be the first
> > client, of course.
> I'd like to avoid this my all means. I'm not a unicode expert, but what I've
> learned already sounds complex enough. Besides, competing with existing
> solutions like ICU ( is not so good idea.

I agree that this is not the best choice for you at this time, not because I
think boost should avoid competing with the ICU, but because I think that it's
beyond the scope of your current work.

> 1. When declaring each option, one should be able to specify whether the
> value should be parsed using unicode, or ascii. If it should be parsed
> using unicode, all unicode issues (e.g. normalization), are up to the
> client.

OK, I have to say at this point that I have not spent much time looking at the
PO design itself (nor do I have the time right now; it was only the mention of
Unicode that brought me out of lurking) so I may be confused about what's going
on here.

That said, the way I understand it is that you have some character-based input
(argv, config file, environment) that's passed to your library, which you then
need to parse for options based on a client-provided specification and set some
variables back in the client. The level of Unicode support you want is that you
want to accept Unicode characters on input (presumably the config file, as I am
not aware of a wchar_t argv variant), and do something sensible (i.e., not
mangle the values) from there.

I am going to guess that all the characters used as delimiters in the parsing
code are ASCII. If that is the case, you could simply continue to treat all
strings as containers of code points and you would not run into any problems
except when parsing a string that contains a delimiter character followed by a
combining mark; for example, foo="bar"<combining mark>baz" would be incorrectly
parsed as foo="bar".

However, you already have to deal with the case of embedded delimiters, and
there is no reason why you can't extend whatever you are doing now to this case;
for example, if I would have had to write foo="bar\"baz" if the embedded "
didn't have a combining mark following it, then I could just as well be required
to use foo="bar\"<combining mark>baz" if the embedded " does have a combining
mark following it.

So, (and again, this is based on very little information about program_options
and mostly on a quick sketch of it that I formed in my head this evening), it
seems that as long as you have a mechanism right now to cope with embedding
delimiters in program options, you should be able to continue using essentially
the same mechanism to cope with Unicode strings. From there, you can decompose
your input into keys and values, and now you are left with parsing the values;
string values are converted according to the locale (as you said yourself) and
numeric values are parsed probably by converting them to ASCII and then using
the method you already have.

> It might be even simpler: since I only look for characters in ascii, so I
> even don't care about canonical form -- I believe all ascii characters are
> unambigous (I mean strict 7-bit ascii).

Yes, except in the case where they are followed by a combining mark; see remarks

> For now, I don't plan to support Unicode in option names, so string
> comparison is not yet needed.

Oh good :-)

> Thanks for your comments!

You are welcome! I am admittedly somewhat tired right now, I hope I am making
myself clear enough. :-)


If this message helped you, consider buying an item
from my wish list: <>

Boost list run by bdawes at, gregod at, cpdaniel at, john at