Boost logo

Boost :

Subject: Re: [boost] [string] Realistic API proposal
From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2011-01-28 07:16:07


On 28/01/2011 11:41, Artyom wrote:

> b) code_point_iterator - back inserter

You could simply define a push_back(char32_t) and have it naturally be
called by std::back_inserter.

> 3. It allows to use std::string meanwhile under the hood as storage
> giving high efficiency when assigning boost::string to std::string
> when the implementation is COW (almost all implementations with
> exception of MSVC)

COW implementations of std::string are not allowed anymore starting with
C++0x.

> 4. It is full unicode aware
> 5. It pushes "UTF-8" idea to standard C++
> 6. You don't pay for what you do not need.

What am I paying for? I don't see how I gain anything.

>
> Proposed API:
> -------------
>
> namespace boost {
>
> // Fully bidirectional iterator
> template<typename UnitsIterator>
> class const_code_point_iterator {
> public:
>
> const_code_point_iterator(UnitsIterator begin,UnitsIterator end); //
> begin
> const_code_point_iterator(UnitsIterator begin,UnitsIterator
> end,UnitsIterator location); // current pos
> const_code_point_iterator(); // end
>
> #ifdef C++0x
> typedef char32_t const_code_point_type;
> #else
> typedef unsigned const_code_point_type;
> #endif

Just define boost::char32 once (depending on BOOST_NO_CHAR32_T) and use
that instead of putting ifdefs everywhere.
(that's what boost/cuchar.hpp does in my library)

> // UTF validation
>
> bool is_valid_utf() const;

See, that's what makes the whole thing pointless.
Your type doesn't add any semantic value on top of std::string, it's
just an agglomeration of free functions into a class. That's a terrible
design.
The only advantage that a specific type for unicode strings would bring
is that it could enforce certain useful invariants.

But your proposal doesn't even enforce the string is valid UTF-8.

Enforcing that the string is in a valid UTF encoding and is normalized
in a specific normalization form can make most Unicode algorithms
several orders of magnitude faster.

Since people seem to want this, so here is a simple proposal:

template<typename T>
struct ustring;

where T must be a Forward Sequence of char, char16, char32 or wchar_t.
The type then acts as an adaptor over that sequence but enforces that
the data is encoded in UTF-X in normalization form C, with X deduced
from the value type of the inner Forward Sequence.

ustring would be an immutable range of code units, with whatever
refinements (bidirectional or random access) the inner Forward Sequence
allows.
I thought it was accepted that strings should be immutable. Otherwise
insertions at the front/back could be added if the underlying forward
sequence allows them.

Its operator+ would return a lazy join expression.

And that's all there is to it. Use free functions for the rest; ustring
could provide some member helpers if that really makes life easier for
some people.

All of this is trivial to implement quickly with my Unicode library.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk