Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Peter Dimov (pdimov_at_[hidden])
Date: 2011-01-18 12:46:41
Dave Abrahams wrote:
> At Tue, 18 Jan 2011 13:27:29 +0200,
> Peter Dimov wrote:
> > Dave Abrahams wrote:
> > > I think the reason to use separate types is to provide a type-safety
> > > barrier between your functions that operate on utf-8 and system or
> > > 3rd-party interfaces that don't or may not. In principle, that should
> > > force you to think about encoding and decoding at all the places where
> > > it may be needed, and should allow you to code naturally and with
> > > confidence where everybody is operating in utf8-land.
> > Yes, in principle. It isn't terribly necessary if everybody is
> > operating in UTF-8 land though.
> But they won't be. That's not today's reality.
They should be, though. As a practical matter, the difference between
taking/returning a string and taking/returning an utf8_t is to force people
to write an explicit conversion. This penalizes people who are already in
UTF-8 land because it forces them to use utf8_t( s, encoding_utf8 ) and
s.c_str( encoding_utf8 ) everywhere, without any gain or need. It's true
that for people whose strings are not UTF-8, forcing those explicit
conversions may be considered a good thing. So it depends on what your goals
are. Do you want to promote the use of UTF-8 for all strings, or do you want
to enable people to remain in non-UTF-8-land?
> > It's a bit like defining a separate integer type for nonnegative
> > ints for type safety reasons - useful in theory, but nobody does it.
> I refer you to Boost.Units
I'm sure that there are many libraries that use units in their interfaces, I
just haven't heard of them. :-)
There's also the additional consideration of utf8_t's invariant. Does it
require valid UTF-8? One possible specification of fopen might be:
FILE* fopen( char const* name, char const* mode );
The 'name' argument must be UTF-8 on Unicode-aware platforms and file
systems such as Windows/NTFS and Mac OS X/HFS+. It can be an arbitrary byte
sequence on encoding-agnostic platforms and file systems such as Linux and
Solaris, but UTF-8 is recommended.
On Windows, the UTF-8 sequence may be invalid due to the presence of UTF-16
surrogates encoded as single code points, but such use is discouraged.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk