Boost logo

Boost :

Subject: Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]
From: bernardH (boost-dev.ml_at_[hidden])
Date: 2011-01-20 03:41:17


Dave Abrahams <dave <at> boostpro.com> writes:

>
> At Wed, 19 Jan 2011 23:25:34 +0000,
> Brent Spillner wrote:
> >
> > On 1/19/2011 11:33 AM, Peter Dimov wrote:
> > > This was the prevailing thinking once. First this number of bits was 16,
> > > which incorrect assumption claimed Microsoft and Java as victims, then
> > > it became 21 (or 22?). Eventually, people realized that this will never
> > > happen even if we allocate 32 bits per character, so here we are.
> >
> > The OED lists ~600,000 words, so 32 bits is enough space to provide a
> > fully pictographic alphabet for over 7,000 languages as rich as English,
> > with room for a few line-drawing characters left over. Surely that's enough?
>
> Even if it's theoretically possible, the best standards organization
> the world has come up with for addressing these issues was unable to
> produce a standard that did it.

I must confess a lack of knowledge wrt to encodings, but my understanding
is that strings are sequences of some raw data (without semantic),
code points and glyphs.

Current/Upcoming std::string , std::u16string and std::u32string
would be the raw data containers, with char*, char16_t* and
char32_t* as random iterators.

I believe that wrt encoding, one size does not fit all because
of the domain/architecture specific tradeoffs between memory
consumption and random access speed.
(However, maybe two sizes fit all, namely utf-8 for compact
representation and utf-32 for random access).

So my uniformed wish would be for something along
(disregarding constness issues for the moment)
namespace std {
namespace unicode
{
template<typename CharT> struct code_points {
 typedef implementation defined iterator;

 explicit code_points(std::basic_string<CharT> & s_): s(s_){}

 iterator begin();
 iterator end();
...
std::basic_string<CharT>& s;
};
// convenience functions
template<typename CharT>
code_points<CharT> as_code_points(std::basic_string<CharT>& s)
{ return code_points<CharT>(s);}

}}
code_points<> would be specialized to
provide a random access code_points<std::char32_t>::iterator
while code_points<char>::iterator would be a forward iterator.

Algorithms processing sequences of code points could
be specialized to take advantage of random access when available.

template<typename CharT> struct glyphs{}; would also be provided
but no random access could be provided (utf-64 anyone ? :) )

Note that the usual idiom of
for( ; b != e; ++b)
{ process(*b); }
would not be as efficient as possible for variable lenght
encoding of code points (e.g. utf-8) because process
certainly performs the same operations as ++b to retrieve the
whole code points, so we should prefer
while( b != e)
{ b= process(b);}

The problem is that I don't have the knowledge to know if
processing code points (instead of glyphs) is truly relevant
in practice. If it is, I believe that something along my
proposal would :
1°) leverage existing std::basic_string<>,
2°) empower the end-user to select the memory consumption
/ algorithmic complexity tradeoff when processing code points.

What do other think of this ?

Best Regards,

Bernard


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk