Subject: Re: [boost] [UTF String] UTF String library 1.5 ready for perusal
From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2011-02-10 18:07:58
On 10/02/2011 18:39, Chad Nelson wrote:
> On Thu, 10 Feb 2011 15:18:27 +0100
> Mathias Gaunard<mathias.gaunard_at_[hidden]> wrote:
>> On 10/02/2011 14:40, Chad Nelson wrote:
>>>>> This version is substantially better than the original. The design
>>>>> has been somewhat simplified, removing extraneous features like
>>>>> null-string emulation. Each of the classes now contain as many of
>>>>> the std::string functions as I could efficiently add (essentially
>>>>> all of them in utf32_t), including I/O stream functions
>>>> Bad design, IMHO.
>>> Not very constructive. *Why* do you think it's a bad design?
>> It's generally agreed on that std::string is a bad design.
>> See GotW #84 for example. That must be a good ten years old...
> Maybe so, but irrelevant in this case. The goal was to make
> transitioning from std::string to the UTF types as painless as
> possible, for those who want to do it, and that means duplicating as
> many of std::string's functions as can efficiently be done.
>>>>> and also features code-point iterators.
>>>> That code point iterator uses pointers and indexes instead of
>>>> iterators, which means it cannot work as an arbitrary iterator
>>>> adaptor even though it could with virtually no change, especially
>>>> since it only requires a forward iterator.
>>> Sorry, I don't understand the reasoning behind that assertion. Please
>>> enlighten me.
>> There is no need for any reasoning: look at the code of your code
>> point iterator. It uses a pointer and indexes, and is therefore not a
>> generic iterator adaptor.
> It wasn't meant to be generic. It was meant to be exactly what it is:
> an iterator specific to the UTF type where it's defined. For that
> purpose, it's designed exactly as it should be, IMHO.
It was already stated on this list that the ability to deal with
arbitrary ranges is more valuable; I am merely stating that your
iterator could work with any iterator with virtually no change.
> I could make it fully generic, but it wouldn't be nearly as efficient
> that way. I chose to do the extra work to make it efficient.
Your code never uses the fact that the iterator is a pointer or that
memory is stored contiguously.
I also don't think that for example your utf-8 iterating strategy is
very fast. Your utf-8 decoding itself seems to have lots of repetition
and unnecessary tests and memory accesses...
Not that it is easy to make that kind of thing fast anyway.
> It could be a bidirectional iterator, as it has all of the abilities of
> one. And it could be a random access iterator, as it has all but one of
> the requirements for that (and in many cases has all of them). Given
> that choice, I chose to make it a random access iterator.
It has constant-time distance, which isn't very useful and adds
unnecessary overhead to your iterator.
>> I'm not a fan of returning a reference in operator* as well.
> No choice in that, I ran into at least one STL algorithm under GCC that
> wouldn't compile if it wasn't a reference, even when it was only being
> read. I don't remember which one, but it was something important and
> commonly-used enough that breaking it was not an option.
I believe the standard containers indeed require it to be a reference,
but I'm not aware of any problems with any implementation.
Would you mind telling me what libstdc++ algorithm relies on this?
> They give me a way to prevent my iterator code from walking off the
> beginning or end of the underlying string.
> The only other way to do it
> would be to store a pointer to the string object in every iterator, or
> a pair of iterators or pointers to the underlying type, which I
> considered worse.
You seek the next "first" character of a code-unit sequence. That indeed
causes problems when you reach the end (unless you put a 0 at the end of
your buffer, which isn't such a bad idea).
UTF-8 and 16, however, don't require this. You can deduce how many code
units you need to consume from the first code unit. You do that in your
> As an important side benefit, they also provide an
> efficient way to calculate the difference in code points for operator-,
> which I feel is important.
> And the size in code-points is (supposed to be) stored at all times.
What's the point of this?
The size in code points is not a very useful thing in general.
> All the UTF types were very carefully designed so that there's no
> chance of invalid data in them, barring extraordinary measures to
> deliberately corrupt it.
That's a good thing.
However, you could use this opportunity to make decoding much faster,
since you don't need to check for correctness anymore.
> Anything else? :-)
You would probably encounter problems on platforms where int is 8 or 16
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk