Boost logo

Boost :

Subject: Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]
From: Patrick Horgan (phorgan1_at_[hidden])
Date: 2011-01-19 23:15:55


On 01/19/2011 07:34 AM, Alexander Lamaison wrote:
> On Wed, 19 Jan 2011 16:13:04 +0100, Matus Chochlik wrote:
>
>>> I do not believe that UTF-8 is the way to go. In fact I know it is not,
>>> except perhaps for the very near future for some programmers ( Linux
>>> advocates ).
>> :-) Just for the record, I'm not a Linux advocate any more then I'm
>> a Windows advocate. I use both .. I'm writing this on a windows machine.
>> What I would like is the whole encoding madness/dysfunction (including
>> but not limited to the dual TCHAR/whateverchar-based interfaces) to stop.
>> Everywhere.
> Even if I bought the UTF-8ed-Boost idea, what would we do about the STL
> implementation on Windows which expects local-codepage narrow strings? Are
> we hoping MS etc. change these to match? Because otherwise we'll be
> converting between narrow encodings for the rest of eternity.
That's the reality already. As long as people use local narrow
encodings we will be converting between them. If your code runs on
Windows in Korea or in Spain, you'll get local-codepage narrow strings
that are incompatible. At least if there was a utf-8_string type, or
utf16_string type, or utf-32_string type, with documentation about how
to implement templated conversions to them, (code conversion facets),
someone could write a library to use them, and everyone using all of
these different local encodings would know what to do to use the
library. The way it is today it's much more difficult to figure out how
to write a generic library that accepts text from a user. What's a
char* or a std::string<char> imply about encoding? Who knows what
you'll get. A local 8 bit code page? Shift-JIS? utf-8? euc? This is
just saying that, hey, here's one way to deal with this issue.

This sort of scheme lets the Windows STL implementation exist, but says,
here's what you need to do so that I know how to treat the text you pass
to me as an argument. If it's in a local code page you need to convert
it to what I want. With validating string types that support the three
UCS encodings you can trust that the data is validly encoded, although
all the normal issues about whether the content is meaningful to you
still exist.

If you use normal code conversion facets as specified for C++ locales,
for conversion from local code pages to your strings, then you can
leverage existing work. Why reinvent the wheel?

Patrick


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk