Boost logo

Boost :

From: Vladimir Prus (ghost_at_[hidden])
Date: 2004-04-06 09:06:42


Ferdinand Prantl wrote:

>> 1. Should 'unicode support' mean that there are two versions
>> of each interface, one using string and the other using
>> wstring? I think this kind of Unicode support is not good. It
>> means that each library which accepts or returns strings must
>> ultimately have double interface and be either entirely in
>> headers, or use instantinate two variants in sources -- which
>> doubles the size of the library.
>
> Not all libraries must have doubled interface or templated interface for
> any basic_string<C, T, A>. Different libraries have different string on
> their interface and one must convert among them. Typical is for example
> ANSI <-> UCS-2 on Windows. It is only convenient to be able to avoid such
> conversions when working with standard or widely used libraries, for
> example boost_regex, which provide templated interface and one can choose
> 8-bit or 16-bit character space (in future probly 32-bit with a new
> basic_string<> ;-). I think program_options, being also such a widely used
> and quite small library, shoud be also templated (no offence, meant just
> the size, not the importance ;-).

Making it templated would mean that using the library increases code size
for the client -- which I really want to avoid.

As for convenience -- consider the cases in the design document. One library
does not care about Unicode and exposes options_description<char>. The main
application uses unicode, so declares options_description<wchar_t>. Now, to
add library's option into program options some additional conversion is
needed.

The only advantage of making the library templated is that you don't need to
convert input into internal encoding, so it might be faster. But is it
really important for a library which is not going to be performance
bottleneck? (E.g. for boost::regex speed is much more important).

>> 2. Should program_options library use UTF-8 or wstring. As
>> I've said, neither is clear leader, but UTF-8 seems better.
>
> Here I disagree. Command-line shells work with all characters in the
> current locale (the whole 255 characters space of 8 bits is used). You
> would give the user a character array in UTF-8 encoding, which is not
> typical case today, one processes the parameters by
> basic_string<char>(argv[x]) in the current locale.

I'm sorry but I'm lost. What does "you would give the user a character
array" mean?

> I think you should simply use basic_string<> as a template and the
> encoding let on the caller providing its specialization or perform the
> conversion himself. Or support the encoding internally by providing an
> interface to set it, not to do it with a fixed encoding support, even if I
> like UTF-8 because it suppors full Unicode character range, not like
> UCS-2.
>
> Thinking more, you can expect rather short strings coming to
> program_options, not megabytes of text. For this usage is more suitable to
> use fixed-size character encodings because they are faster and easier to
> work with, having direct support in basic_string<>.

Ok, I have to reiterate: the biggest advantage of UTF-8 is that the existing
command like parser will just work with UTF-8, so the "easier" point above
does not apply, IMO. As for speed: again I think it's of minor importance.

>> That's all, and given that there's at least two UTF-8 codecs
>> announced on the mailing list, not a lof of work. And this
>> will add Unicode support without changing interface a bit.
>
> Yes, you are right; there is not much work to add the conversion code into
> the internals of program_options. I also wrote my own UCS-2 <-> UTF-8
> encoding routines to use them for basic_string<char> <->
> basic_string<wchar_t> conversion. However, I think, that we should reuse
> as much as possible and not to rewrite similar code in every library,
> which works with strings coming from a real user input.

Heh, I'm not going to rewrite anything -- I'll use one of the facets that
are available.

> Your solution to support UTF-8 invisibly changes the interface anyway -
> not the text of prototypes directly but the behavior of the interface
> (encoding of strings).

The encoding of user-visible strings is not changed. The only user
difference I see is that by default, char* input will be require to be
7-bit. But I think even this is not stricty required.

> Nevertheless, you could support this encoding conversion not only by
> providing your own conversion routines, but rather by accepting existgin
> facets, which help streams similarly (as I wrote in the former e-mail).
> Then one could simply write a conversion facet once and use it for a
> stream input and also for a cmdline input, sharing the implementation.

That's for sure. I plan to use facets as much as possible.

- Volodya


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk