Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Peter Dimov (pdimov_at_[hidden])
Date: 2011-01-17 22:09:16


Alexander Lamaison wrote:
> I don't understand how it could possibly not help. If I see an api
> function call_me(std::string arg) I know next to nothing about what it's
> expecting from the string (except that by convention it tends to mean
> 'string in OS-default encoding').

You should read the documentation of call_me (*). Yes, I know that in the
real world the documentation often doesn't specify an encoding (worse - the
encoding varies between platforms and even versions of the same library),
but if the developer of call_me hasn't bothered to document the encoding of
the argument, he won't bother to use a special UTF-8 type for the argument,
either. :-)

(*) And the documentation should either say that call_me accepts UTF-8, or
that call_me is encoding-agnostic, that is, it treats the string as a byte
sequence.

I can think of one reason to use a separate type - if you want to overload
on encoding:

    void f( latin1_t arg );
    void f( utf8_t arg );

In most such cases that spring to mind, however, what the user actually
wants is:

    void f( string arg, encoding_t enc );

or even

    void f( string arg, string encoding );

In principle, as Chad Nelson says, it's useful to have separate types if the
program uses several different encodings at once, fixed at compile time. I
don't consider such a way of programming a good idea though. Strings should
be either byte sequences or UTF-8; input can be of any encoding, possibly
not known until runtime, but it should always be either processed as a byte
sequence or converted to UTF-8 as a first step.

Regarding the OS-default encoding - if, on Windows, you ever encounter or
create a string in the OS default encoding, you've already lost - this code
can't be correct. :-)


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk