From: Don G (dongryphon_at_[hidden])
Date: 2005-04-05 00:45:37
I thought I would jump in with some small observations:
--- Erik Wien <wien_at_[hidden]> wrote:
> Most users should not really care what kind of encoding
> and normalization form is used. They want to work with
> the string, not fiddle with it's internal representation.
You do care about the representation when communicating with system
API's or writing data to networks or files. For example, say, UTF-32
was the chosen representation, some programmers would be constantly
converting to UTF-16 to call the system, and vise-versa if UTF-16 is
chosen where the system wants something else.
> I would be surprised if any other encoding than UTF-16 would
> end up as the most efficient one. UTF-8 suffers from the big
> variation in code unit count for any given code point and
> UTF-32 is just a waste of space for little performance for
> most users. You never know though.
Here again, the performance measure could easily be dominated by
conversions to the underlying system's encoding, depending on the
Also, on some systems, particularly Mac, the system not only has an
encoding preference, it doesn't particularly like "wchar_t *" either.
On the Mac, most text is a CFString (a handle of sorts to the text).
On Windows, you encounter BSTR's as well.
In my not-so-nearly-thought-out work on this, I decided to have the
default encoding be platform specific to eliminate the enormous
number of conversions that might be otherwise needed. For example, on
the Mac, I had an allocator-like strategy that allowed all
unicode_strings to be backed by a CFString. There was a get_native()
method that returned a platform-specific value (documented on a per
platform basis) to allow platform-specific code to work more
Just some thoughts...
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk