Boost logo

Boost :

Subject: Re: [boost] [string] Realistic API proposal
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2011-01-28 08:02:00


Hi Artyom,

Artyom wrote:
> I'd like to provide a realistic string API proposal:

I've been keeping out of this up to now, but since there is something
concrete here I'll share my thoughts.

> // Fully bidirectional iterator
> template<typename UnitsIterator>
> class const_code_point_iterator {
> public:
>
> const_code_point_iterator(UnitsIterator begin,
> UnitsIterator end); // begin
> const_code_point_iterator(UnitsIterator begin,
> UnitsIterator end,
> UnitsIterator location); // current pos
> const_code_point_iterator(); // end
>
> #ifdef C++0x
> typedef char32_t const_code_point_type;
> #else
> typedef unsigned const_code_point_type;
> #endif
>
> const_code_point_type operator*() const;
> ...
>
> };

I have something broadly like this here:

http://svn.chezphil.org/libpbe/trunk/include/charset/const_character_iterator.hh

I attempted to do this with the character set as a template parameter
and a "charset traits" class providing encoding and decoding
functions. That was probably over-complicated; making it utf-8 only
would be fine - but in that case, it should have a name that says "utf8".

I do find it somewhat unsatisfactory that you need to store the begin
and end of the underlying string. This triples the size of what could
otherwise be a single pointer. I think these are only needed to detect
invalid utf-8, aren't they? In some of my code I had an error_policy
template parameter that allowed you to specify whether the input should
be trusted or not; if it's trusted you can avoid this overhead. Even
then, though, you can't avoid having begin and end in the interface,
adding verbosity.

Another way to avoid storing begin and end is to somehow make those
iterators empty structs (and hence also default-constructable).
Specifically, if your underlying string is guaranteed to be
null-terminated, the end iterator can be stateless. I guess you could
avoid storing the begin iterator by prepending a null, but that doesn't
work for std::string.

> /// Output iterator
> template<typename BackInserter>
> class code_point_iterator {
> public:
>
> code_point_iterator(BackInserter out); // begin
> code_point_iterator(); // end
>
> #ifdef C++0x
> typedef char32_t code_point_type;
> #else
> typedef unsigned code_point_type;
> #endif
>
> code_point_type operator*() const;
> ...
>
> };

So this only allows appending, right? I have something like that here:

http://svn.chezphil.org/libpbe/trunk/include/charset/character_output_iterator.hh

Broadly, I would say that allowing bidirectional reading and
append-only writing is the right thing to do for strings. If anyone
has an hour to spare, it's educational to try hacking your code to use
std::list<char> instead of std::string, and see how much of it still compiles.

> template<typename Char,typename Traits=std::char_traits<Char>,
> typename Alloc=std::allocator<Char> >
> class basic_string {
> public:
> // { boost specific
> typedef std::basic_string<Char,Traits,Alloc> std_string_type;
> // } boost specific
>
> // All std::string standard functions based
>
> // Deprecated interfaces that exist for backward compatibility
> // as they not Unicode aware
>
> value_type &at(size_type indx);
> value_type &operator[](size_type indx);
> iterator begin();
> iterator end();
>
> // { boost specific compatibility functions with std::string, they would go
> // as std::string becode extended with boost::string new interfaces
> //
> basic_string(std_string_type const &other) : data_(other) {}
> basic_string(std_string_type const &other,size_type index,size_type len)
> : data_(other,index,len) {}
>
> ...
>
> operator std_string_type() const
> {
> return data_;
> }
>
> // } boost specific compatibility functions
>
> //
> // Unicode Support
> //
> // ------------------------
> //
>
> //
> // UTF Codepoint iteration
> //
>
> #ifdef C++0x
> typedef char32_t code_point_type;
> #else
> typedef unsigned code_point_type;
> #endif
>
> typedef boost::const_code_point_iterator<const_iterator>
> const_code_point_iterator;
>
> const_code_point_iterator code_point_begin() const
> {
> return const_code_point_iterator(begin(),end());
> }
> const_code_point_iterator code_point_end() const
> {
> return const_code_point_iterator(begin(),end(),end());
> }
>
> typedef boost::code_point_iterator<std::back_inserter<basic_string> >
> code_point_iterator;
>
> code_point_iterator back_inserter()
> {
> return code_point_iterator(std::back_inserter<basic_string>(*this));
> }
>
> basic_string &operator+=(code_point_type code_point);
> basic_string operator+(code_point_type code_point) const;
> void append(code_point_type code_point);

The approach that I would prefer is more like:

template <typename impl_t>
class utf8_string_adaptor {
   impl_t impl;
..
};

typedef utf8_string_adaptor<std::string> utf8_string;

In this way:
- I can wrap other containers than std::string, e.g. sgi::rope, char*,
std::vector etc.
- utf8_string::begin() can return a utf8_character_iterator.
- Accessing the underlying bytes is possible but requires something
explicit e.g. foo.base().begin().

> //
> // Lexical operations on string
> //
>
> // Case handling
>
> basic_string upper_case(std::locale const &l=std::locale()) const;
> basic_string lower_case(std::locale const &l=std::locale()) const;
> basic_string title_case(std::locale const &l=std::locale()) const;
> basic_string fold_case() const; // locale independent
>
> // Unicode normalization
>
> typedef enum {
> nfc,
> nfkc,
> nfd,
> nfkd
> } normalization_mode;
>
> basic_string normalize(normalization_mode mode = nfc) const;
>
> // normalized string constructor
>
> basic_string(basic_string const &,normalization_mode mode);
> basic_string(Char const *,normalization_mode mode);
> basic_string(Char const *,size_t n,normalization_mode mode);
> template<Iterator>
> basic_string(Iterator begin,Iterator end,normalization_mode mode);
>
> void append_normalized(basic_string const &other,normalization_mode mode
> = nfc);
> void append_normalized(Char const *,normalization_mode mode = nfc);
> void append_normalized(Char const *,size_t n,normalization_mode mode =
> nfc);
>
> basic_string concat_normalized(basic_string const
> &other,normalization_mode mode = nfc) const;
> basic_string concat_normalized(Char const *,normalization_mode mode =
> nfc) const;
> basic_string concat_normalized(Char const *,size_t n,normalization_mode
> mode = nfc) const;
>
> // Unicode validation
>
> bool valid_utf() const;
>
[snip]

Surely almost all of that should be in free functions and generic
algorithms, no? E.g. valid_utf8() could be an algorithm that takes a
pair of iterators over bytes, and then it can be used on any sequence.

Regards, Phil.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk