Boost logo

Boost :

Subject: Re: [boost] [gsoc] Request Feedback for Boost.Ustr Unicode String Adapter
From: Yakov Galka (ybungalobill_at_[hidden])
Date: 2011-08-16 13:05:27


On Sat, Aug 13, 2011 at 23:24, Robert Ramey <ramey_at_[hidden]> wrote:

> Dave Abrahams wrote:
>
> >> std::string represents a sequence of "char" objects that happens
> >> to be useful for text processing. It can represent a text in any
> >> encoding.
> >>
> >> The question is how we treat this sequence... And this is a
> >> matter of policy and requirements of the library.
> >
> > I think I agree with Artyom here. *Somebody* has to decide how that
> > datatype will be interpreted when we receive it. Unless we refuse
> > altogether to accept std::string in our interfaces (which sounds like
> > a
> > bad idea to me), why not make the decision that it's UTF-8?
>
> hmmm - why can't we just leave it at "std::string represents a sequence of
> "char""
>

Because we are talking here what 'a sequence of char' means, and you *must*
define it somehow.

and define some derivative class which defines it as a
> "a refinement of std::string which supports UTF-8 functionality" ?
>

Even when wrapping it you must still define the conversions from 'sequences
of chars'. Here we come to the original problem.

On Mon, Aug 15, 2011 at 16:19, Stewart, Robert <Robert.Stewart_at_[hidden]>wrote:

> [...]
> As soon as the client did a cast, the client made the claim that
> non_utf_string met the requirements of the text class' constructor. The
> problem is that of the client misusing the class by an ill-advised cast.
> What's more, I think Soares indicated a debug-build validation that the
> argument indeed was UTF-8.
>
> I don't see a problem in that design, once the constructor is explicit.
>

I don't want to do any explicit casts. I want UTF-8 by default, at least as
an optional feature for me and others who think like me. I can afford the
risk of writing wrong code, which is really small if you know what you're
doing. And I'm saying this as a maintainer of ~1MLOC codebase which uses
this convention on *windows*.

Regarding UTF-8 validation, it's not bullet-proof. Many non-UTF8 sequences
may pass the validation. 8-bit encodings that don't coincide with ASCII are
even more likely to result in false positives.

> > > Besided it does not harm you in any way
> >
> > It does. I already use UTF-8 for all my strings, even on
> > windows, and I don't want the code-bloat of all these
> > conversions (even if they're no-ops).
>
> What code bloat do you get from NOPs? Sure, there is more compilation time
> for the compiler to parse the text code and then for the optimizer to
> streamline it into a NOP, but even that is very likely negligible.
>

I'm talking about source-code bloat. About the boilerplate code I have to
write even if I already use UTF-8 everywhere:

std::string str = some_utf_8_string;
boost::utf8_function(text(str)); // Yes, I like UTF-8
boost2::utf8_function(str); // but I like it more when it's the default.

-- 
Yakov

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk