Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Patrick Horgan (phorgan1_at_[hidden])
Date: 2011-01-18 12:54:53


On 01/18/2011 03:27 AM, Peter Dimov wrote:
> Dave Abrahams wrote:
>
>> I think the reason to use separate types is to provide a type-safety
>> barrier between your functions that operate on utf-8 and system or
>> 3rd-party interfaces that don't or may not. In principle, that should
>> force you to think about encoding and decoding at all the places where
>> it may be needed, and should allow you to code naturally and with
>> confidence where everybody is operating in utf8-land.
>
> Yes, in principle. It isn't terribly necessary if everybody is
> operating in UTF-8 land though. It's a bit like defining a separate
> integer type for nonnegative ints for type safety reasons - useful in
> theory, but nobody does it.
Are you saying that no one uses unsigned int for non-negative ints?
I'm thinking I'm just misunderstanding you. I work with whole groups of
people that are careful to declare things to match their use to take
advantage of the compiler diagnostics. Show me any large body of code
where people are sloppy about this I'll turn on the appropriate warnings
and find bugs for you by inspection. My experience is that declaring
everything int is something beginners do but once they've been bitten by
the inevitable subtle and not so subtle bugs, intermediate level
programmers learn to declare as unsigned things that will always be
non-negative and for which it would be a mistake to ever be negative.
In spite of being a good programmer with years of experience I make a
constant series of sloppy coding errors and am thankful for every
category the compiler will tell me about. Everyone that has ever worked
at a place that builds with warnings turned up and wants the warnings
gone has gone through this and learned these lessons. That's why I
think I'm probably misunderstanding you.
>
> If you're designing an interface that takes UTF-8 strings, it still
> may be worth it to have the parameters be of a utf8-specific type, if
> you want to force your users to think about the encoding of the
> argument each time they call one of your functions... this is a
> legitimate design decision. If you're in control of the whole program,
> though, it's usually not worth it - you just keep everything in UTF-8.
It's exactly why you would do it. It gets the compiler involved and it
will give you diagnostics that make it harder for you to do the wrong
thing. If the converting constructors for the utf-8 specific type are
all explicit, so you can't accidentally get rid of the warning and
_still_ have incorrect code, all the better. Better to be correct by
design when you can.

Patrick


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk