Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Dave Abrahams (dave_at_[hidden])
Date: 2011-01-17 22:33:41


On Mon, Jan 17, 2011 at 10:09 PM, Peter Dimov <pdimov_at_[hidden]> wrote:
> Alexander Lamaison wrote:
>>
>> I don't understand how it could possibly not help.  If I see an api
>> function call_me(std::string arg) I know next to nothing about what it's
>> expecting from the string (except that by convention it tends to mean
>> 'string in OS-default encoding').
>
> You should read the documentation of call_me (*). Yes, I know that in the
> real world the documentation often doesn't specify an encoding (worse - the
> encoding varies between platforms and even versions of the same library),
> but if the developer of call_me hasn't bothered to document the encoding of
> the argument, he won't bother to use a special UTF-8 type for the argument,
> either. :-)
>
> (*) And the documentation should either say that call_me accepts UTF-8, or
> that call_me is encoding-agnostic, that is, it treats the string as a byte
> sequence.
>
> I can think of one reason to use a separate type - if you want to overload
> on encoding:
>
>   void f( latin1_t arg );
>   void f( utf8_t arg );
>
> In most such cases that spring to mind, however, what the user actually
> wants is:
>
>   void f( string arg, encoding_t enc );
>
> or even
>
>   void f( string arg, string encoding );
>
> In principle, as Chad Nelson says, it's useful to have separate types if the
> program uses several different encodings at once, fixed at compile time. I
> don't consider such a way of programming a good idea though. Strings should
> be either byte sequences or UTF-8; input can be of any encoding, possibly
> not known until runtime, but it should always be either processed as a byte
> sequence or converted to UTF-8 as a first step.

DISCLAIMER: I have almost no experience with the details of this
stuff. I only know a few general things about programming (fewer
every day).

I think the reason to use separate types is to provide a type-safety
barrier between your functions that operate on utf-8 and system or
3rd-party interfaces that don't or may not. In principle, that should
force you to think about encoding and decoding at all the places where
it may be needed, and should allow you to code naturally and with
confidence where everybody is operating in utf8-land. The typical
failures I've seen, where there is no such mechanism (e.g. in Python
where there's no static typing), are caused because programmers lose
track of whether what they're handling is encoded as utf-8 or not.

-- 
Dave Abrahams
BoostPro Computing
http://www.boostpro.com

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk