Boost logo

Boost :

Subject: Re: [boost] [general] What will string handling in C++ look like in the future
From: Marsh Ray (marsh_at_[hidden])
Date: 2011-02-08 18:55:51


On 02/08/2011 04:17 PM, Chad Nelson wrote:
>
> A good rule of thumb, but keep in mind that ASCII (or more formally
> "US-ASCII") is the colloquial name for the seven-bit ISO 646 encoding,

Everybody knows that. But which one? Everyone doesn't agree and there is
significant variation.

You can't point to a single standard that even a majority of people
agree on as being the official ASCII. It was revised many times over the
years.

So it's a bad way to refer to a spec. It's probably why GCC prints
compiler warning messages using the backtick/grave and apostrophe as if
they were paired single quotes. It's broken.

> and "ANSI" was used for Windows code-page 1252 because Microsoft based
> it on an early ISO-8859-1 draft.[1] (The name is still in use in the
> Windows API, but they say it's a "historical reference, but is nowadays
> a misnomer that continues to persist in the Windows community.")

MS also used it to contrast with the "OEM" code page, which was their
way of saying "ANSI" was for system stuff that didn't change (e.g. DLL
names) and "OEM" was for UI and interoperable stuff that was deeply
customized for foreign markets.

> The blame for "Unicode encoding" can probably be laid on Microsoft
> too.[2]

Unicode was originally sold as a 16-bit fixed-width encoding, with
perhaps just the minor variation for endianness. 64K characters ought to
be enough for anybody they said. But they just couldn't stop themselves
from inflicting yet another endless variety of multibyte encodings on
the world.

> (Sorry to get pedantic on you, just taking a break before the

No, surely it was I who was trolling for pedantry. Sorry!

> hopefully-final coding session on my UTF string library, which includes
> converter classes for many common code-pages, including ascii (typedef
> of us_ascii) and windows_ansi (typedef of windows1252)... I've been
> swimming in this stuff for the last several weeks. ;-) )

Oh I know. I worked on that stuff for many years while working on
document printing and display software. The variations are endless. I
used to keep a book on my desk that was over an inch thick of just code
pages. Half of them were "ASCII" code pages. The other half were "EBCDIC".

Perhaps you've seen this:
http://en.wikipedia.org/wiki/ISO/IEC_646#National_variants

I still can't figure out of "ISO 646 US" and "ANSI X3.4-1968" are the
same as Unicode U+0000 - U+007F (for those 128 points). I think there
are some differences.

You can maybe get away with "US ASCII" in the US (other than Spanish
speakers), Canada (other than Quebec), Austrailia and New Zealand. Maybe
a few other places. But make sure you reference a modern relevant
standard for it. It'd probably be better if you just referenced the
specific standards directly and avoid the imprecise term "ASCII".

- Marsh


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk