Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Artyom (artyomtnk_at_[hidden])
Date: 2011-01-18 13:36:04


----- Original Message ----
> From: Peter Dimov <pdimov_at_[hidden]>
>
> Dave Abrahams wrote:
> > At Tue, 18 Jan 2011 13:27:29 +0200,
> > Peter Dimov wrote:
> > >
> > > Dave Abrahams wrote:
> > >
> > > > I think the reason to use separate types is to provide a type-safety
> > > > barrier between your functions that operate on utf-8 and system or
> > > > 3rd-party interfaces that don't or may not. In principle, that should
> > > > force you to think about encoding and decoding at all the places where
> > > > it may be needed, and should allow you to code naturally and with
> > > > confidence where everybody is operating in utf8-land.
> > >
> > > Yes, in principle. It isn't terribly necessary if everybody is
> > > operating in UTF-8 land though.
> >
> > But they won't be. That's not today's reality.
>
> They should be, though. As a practical matter, the difference
> between taking/returning a string and taking/returning an
> utf8_t is to force people to write an explicit conversion.
> This penalizes people who are already in UTF-8 land because
> it forces them to use utf8_t( s, encoding_utf8 ) and
> s.c_str( encoding_utf8 ) everywhere, without any gain or
> need. It's true that for people whose strings are not UTF-8,
> forcing those explicit conversions may be considered a good
> thing. So it depends on what your goals are. Do you want to
> promote the use of UTF-8 for all strings, or do you want to
> enable people to remain in non-UTF-8-land?

+1

>
> There's also the additional consideration of utf8_t's invariant. Does it
> require valid UTF-8? One possible specification of fopen might be:
>
> FILE* fopen( char const* name, char const* mode );
>
> The 'name' argument must be UTF-8 on Unicode-aware platforms and
> file systems such as Windows/NTFS and Mac OS X/HFS+. It can be an
> arbitrary byte sequence on encoding-agnostic platforms and file
> systems such as Linux and Solaris, but UTF-8 is recommended.
>

+1 As well.

Also I would like to add a small note of general C++ design as
a language: don't pay on what you don't need.

And 95% of all uses of strings is encoding agnostic!

Artyom

      


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk