Boost logo

Boost :

Subject: Re: [boost] [strings][unicode] Proposals for Improved String Interoperability in a Unicode World
From: Yakov Galka (ybungalobill_at_[hidden])
Date: 2012-01-31 04:52:30


On Tue, Jan 31, 2012 at 10:57, Daryle Walker <darylew_at_[hidden]> wrote:

>
> ----------------------------------------
> > Date: Mon, 30 Jan 2012 00:24:30 -0800
> > From: Artyom
> >
> > ----- Original Message -----
> > > From: Beman Dawes <bdawes_at_[hidden]>
> > >
> > >> What probably should be done is that compilers should be compelled to
> > >> support UTF-8 as the source character set in a unified way.
> > >
> > > Makes sense to me.
> > >
> > > Why don't you write up an issue for the C and C++ committees? My
> > >
> > > [snip]
> > >
> > > Another possibility is to start lobbying compiler vendors, or at least
> > > Microsoft, to support UTF-8 both with and without BOM.
> > >
> >
> > It is not only BOM not BOM issue. It is mostly the ability
> > to define execution character set. i.e. character set for
> > normal "some text" literals and the input character set
> > and what is even more important that C++ compilers must
> > support UTF-8 for the two of them.
>
> This probably isn't the right post to respond to, but I don't want to
> spend forever figuring it out.
>
> Not every system is a 8/16/32(/64)-bit computer using
> ASCII/Latin-1/UTF-8. C++ (from C) was designed so a user with a
> 9/36/81-bit EBSDIC system and one with a 8/16/32/64 UTF-16 system can write
> programs for the other (with the appropriate cross-compiler). We don't
> want to obnoxiously be prejudiced against systems not matching the current
> configuration trends.
>
> (I was originally going to write "9/36/72", but then realized that higher
> types only have to be a multiple of char, not each other, so my new system
> breaks more common-programmer assumptions. BTW, that's 9-bit bytes (char),
> 36-bit words (short and int), and 81-bit long-words (long and long-long).
> I wonder if anyone here can fabricate this custom hardware, to mess people
> up.)
>
> Daryle W.
>
>
Thanks Daryle. I'm aware of this issue and thus restrained from talking
about UTF-8 only. The wording I'm interested in is "execution character set
is capable of storing any Unicode data". This would mean that it will be
UTF-8 on systems having CHAR_BIT==8 and compatible with ASCII, UTF-EBCDIC
on IBM mainframes, perhaps UTF-32 on DSP with CHAR_BIT==32 and sizeof(char)
== sizeof(long). Yet another option is to restrict the requirement to
hosted implementations only.

-- 
Yakov

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk