Boost logo

Boost :

Subject: [boost] [string] proposal
From: Matus Chochlik (chochlik_at_[hidden])
Date: 2011-01-21 06:25:07


Dear list,

following the whole string encoding discussion I would like
to make some suggestions.

>From the whole debate it is becoming clear, that
instant switch from encoding-agnostic/platform-native
std::string to UTF-8-encoded std::string is not likely
to happen.

Then it was proposed that we create a utf8_t string type
that would be used *together* (for all eternity) with
the standard basic_string<>. While I see the advantages
here, I (as I already said elsewhere) have the following
problem with this approach:

Using a name like utf8_t or u8string, string_utf8, etc.
at least to me (and I've consulted this off the list,
with several people) suggests, that UTF-8 is still
something special and IMO also sends the message
that it is OK to remain forever with the various encodings
and std::string as it is today. We should *IMO* endorse
the opposite.

My suggestion is the following:

Let us create a class called boost::string that will have
all the properties that a string handling class in 2011+ A.D.
should have, basically what std::string should have been.

Then there are two alternatives:

a) When all the zillions lines of legacy code in FORTRAN,
COBOL, BASIC, LOGO, etc. :) are fixed / ported / abandoned,
and UTF-8 becomes a true standard for text encoding
widely accepted by the whole IT industry and markets,
and all the issues that prevent us from doing the transition
now are resolved, this string becomes the standard, like
many other things from Boost in the past, and replaces
the current std::string.

b) As some (having much more insight into how the
standardizing comitee works than I do) have pointed out,
it will never become a true standard. But with the Boost's
influence it at least becomes a de-facto standard for
strings and it is (hopefully) adopted by the libraries
that currently feel the need to invent string-classes
themselves (with a good reason).

Also I've uploaded into the vault file string_proposal.zip
containing my (naive and un-expert-ly) idea what the
interface for boost::string and the related-classes could
look like (it still needs some work and it is completelly
un-optimized, un-beautified, etc.).

/me ducks and covers :)

The idea is that, let std::string/wstring be platform-specifically-
-encoded as it is now, but also let the boost::string handle
the conversions as transparently as possible so if in
case the standard adopts it, std::string would become
a synonym for boost::string.

It is only partially implemented and there are two examples
showing how things could work, but the real UTF-8 validation,
transcoding, error handling, is of course missing. Remember
it is aimed at the design of the interfaces at this point.

If you have the time, have a look and if my suggestions
and/or the code looks completely wrong, please, feel free
to slash it to pieces :), and if you feel up to it, propose
something better.

If this or something completely different and much better
that comes out of it, will be agreed upon, we could set up
a dedicated git repository for Boost.String and maybe try if
the new suggested collaborative development in
per-boost-component repositories really works. :)
If some of the people that are skilled with unicode
would join or lead the effort it would be awesome.

Best,

Matus


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk