Boost logo

Boost :

From: James Porter (porterj_at_[hidden])
Date: 2007-09-27 12:56:06


On 9/27/07, Sebastian Redl <sebastian.redl_at_[hidden]> wrote:
>
> Just nit-picking here: it converts to wchar_t, which may or may not be
> UTF-16. On Win32 platforms, it is, but on Linux, for example, it's UTF-32.

Yeah, I realized that after I clicked "send". I guess I should eat breakfast
before sending email. :)

> True. I think the strings should be immutable. I think experience with
> Java and C# compared to C++ shows that an immutable string class is
> superior in most use cases.

There should be some means to (possibly indirectly) modify a
variable-width-encoded string, though it doesn't necessarily have to be
through the class itself. A stringstream may be more appropriate.

> That said, I think a good (general) roadmap for this project would be:
> > 1) Extend std::basic_string to store UCS-2 / UCS-4 (should be easy,
> though
> > string constants may pose a problem)
> >
> Doesn't basic_string<wchar_t> do just that already?

It doesn't do it in a portable manner. In Windows, basic_string<wchar_t> is,
ostensibly, UTF-16, but in Linux, it's UTF-32. There should be a portable
solution that guarantees a particular fixed-width encoding. I'd argue that
basic_string<wchar_t> isn't exactly Unicode at all, though I'm being
nit-picky. char_traits<wchar_t>::state_type is mbstate_t, which is the state
type used by codecvt to convert a narrow (ASCII) stream into a wide stream.
In short, the stream (and ultimately the string) isn't Unicode, it's just
ASCII stored with 2 (or 4) bytes per character. This goes back to the
problems with using wfstream.

I think, to have a truly distinct basic_string specialization, we'd need
portable 16- and 32-bit char types, and a way to unambiguously specify its
encoding. My hope is that we can use char_traits<...>::state_type as a way
to make code conversion simpler. Ideally, I'd like something that examines
the state_type of the source and the target, and builds a converter based on
those two pieces of information. It would be great if I could say something
like:

  ofstream<utf8> file("out.txt");
  file << ucs4string << utf16string << jisstring << asciistring << endl;

and have it work automatically.

- James


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk