Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Emil Dotchevski (emil_at_[hidden])
Date: 2011-01-18 00:46:36


On Mon, Jan 17, 2011 at 7:33 PM, Dave Abrahams <dave_at_[hidden]> wrote:
> On Mon, Jan 17, 2011 at 10:09 PM, Peter Dimov <pdimov_at_[hidden]> wrote:
>> Alexander Lamaison wrote:
>>>
>>> I don't understand how it could possibly not help.  If I see an api
>>> function call_me(std::string arg) I know next to nothing about what it's
>>> expecting from the string (except that by convention it tends to mean
>>> 'string in OS-default encoding').
>>
>> You should read the documentation of call_me (*). Yes, I know that in the
>> real world the documentation often doesn't specify an encoding (worse - the
>> encoding varies between platforms and even versions of the same library),
>> but if the developer of call_me hasn't bothered to document the encoding of
>> the argument, he won't bother to use a special UTF-8 type for the argument,
>> either. :-)
>>
>> (*) And the documentation should either say that call_me accepts UTF-8, or
>> that call_me is encoding-agnostic, that is, it treats the string as a byte
>> sequence.
>>
>> I can think of one reason to use a separate type - if you want to overload
>> on encoding:
>>
>>   void f( latin1_t arg );
>>   void f( utf8_t arg );
>>
>> In most such cases that spring to mind, however, what the user actually
>> wants is:
>>
>>   void f( string arg, encoding_t enc );
>>
>> or even
>>
>>   void f( string arg, string encoding );
>>
>> In principle, as Chad Nelson says, it's useful to have separate types if the
>> program uses several different encodings at once, fixed at compile time. I
>> don't consider such a way of programming a good idea though. Strings should
>> be either byte sequences or UTF-8; input can be of any encoding, possibly
>> not known until runtime, but it should always be either processed as a byte
>> sequence or converted to UTF-8 as a first step.
>
> DISCLAIMER: I have almost no experience with the details of this
> stuff.  I only know a few general things about programming (fewer
> every day).
>
> I think the reason to use separate types is to provide a type-safety
> barrier between your functions that operate on utf-8 and system or
> 3rd-party interfaces that don't or may not.  In principle, that should
> force you to think about encoding and decoding at all the places where
> it may be needed, and should allow you to code naturally and with
> confidence where everybody is operating in utf8-land.  The typical
> failures I've seen, where there is no such mechanism (e.g. in Python
> where there's no static typing), are caused because programmers lose
> track of whether what they're handling is encoded as utf-8 or not.

UTF-8 allows the use of char * for type erasure for strings, much like
void * allows that in general. Using C++ type tags to discriminate
between different data pointed by void pointers is mostly redundant
except when type safety is postponed until run-time; and that's only
marginally safer than using string tags.

Emil Dotchevski
Reverge Studios, Inc.
http://revergestudios.com/reblog/index.php?n=ReCode.ReCode


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk