Boost :

Date view	Thread view	Subject view	Author view

Subject: Re: [boost] [UTF String] UTF String library 1.5 ready for perusal
From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2011-02-10 18:07:58

Next message: Larry Evans: "Re: [boost] Adding libc++ support to boost."
Previous message: Edward Diener: "Re: [boost] [type_traits] extension has_operator_xxx - cv qualifiers and references"
In reply to: Chad Nelson: "Re: [boost] [UTF String] UTF String library 1.5 ready for perusal"
Next in thread: Chad Nelson: "Re: [boost] [UTF String] UTF String library 1.5 ready for perusal"
Reply: Chad Nelson: "Re: [boost] [UTF String] UTF String library 1.5 ready for perusal"

On 10/02/2011 18:39, Chad Nelson wrote:
> On Thu, 10 Feb 2011 15:18:27 +0100
> Mathias Gaunard<mathias.gaunard_at_[hidden]> wrote:
>
>> On 10/02/2011 14:40, Chad Nelson wrote:
>>
>>>>> This version is substantially better than the original. The design
>>>>> has been somewhat simplified, removing extraneous features like
>>>>> null-string emulation. Each of the classes now contain as many of
>>>>> the std::string functions as I could efficiently add (essentially
>>>>> all of them in utf32_t), including I/O stream functions
>>>>
>>>> Bad design, IMHO.
>>>
>>> Not very constructive. *Why* do you think it's a bad design?
>>
>> It's generally agreed on that std::string is a bad design.
>> See GotW #84 for example. That must be a good ten years old...
>
> Maybe so, but irrelevant in this case. The goal was to make
> transitioning from std::string to the UTF types as painless as
> possible, for those who want to do it, and that means duplicating as
> many of std::string's functions as can efficiently be done.
>
>>>>> and also features code-point iterators.
>>>>
>>>> That code point iterator uses pointers and indexes instead of
>>>> iterators, which means it cannot work as an arbitrary iterator
>>>> adaptor even though it could with virtually no change, especially
>>>> since it only requires a forward iterator.
>>>
>>> Sorry, I don't understand the reasoning behind that assertion. Please
>>> enlighten me.
>>
>> There is no need for any reasoning: look at the code of your code
>> point iterator. It uses a pointer and indexes, and is therefore not a
>> generic iterator adaptor.
>
> It wasn't meant to be generic. It was meant to be exactly what it is:
> an iterator specific to the UTF type where it's defined. For that
> purpose, it's designed exactly as it should be, IMHO.

It was already stated on this list that the ability to deal with
arbitrary ranges is more valuable; I am merely stating that your
iterator could work with any iterator with virtually no change.

> I could make it fully generic, but it wouldn't be nearly as efficient
> that way. I chose to do the extra work to make it efficient.

Your code never uses the fact that the iterator is a pointer or that
memory is stored contiguously.
I also don't think that for example your utf-8 iterating strategy is
very fast. Your utf-8 decoding itself seems to have lots of repetition
and unnecessary tests and memory accesses...

Not that it is easy to make that kind of thing fast anyway.

> It could be a bidirectional iterator, as it has all of the abilities of
> one. And it could be a random access iterator, as it has all but one of
> the requirements for that (and in many cases has all of them). Given
> that choice, I chose to make it a random access iterator.

It has constant-time distance, which isn't very useful and adds
unnecessary overhead to your iterator.

>
>> I'm not a fan of returning a reference in operator* as well.
>
> No choice in that, I ran into at least one STL algorithm under GCC that
> wouldn't compile if it wasn't a reference, even when it was only being
> read. I don't remember which one, but it was something important and
> commonly-used enough that breaking it was not an option.

I believe the standard containers indeed require it to be a reference,
but I'm not aware of any problems with any implementation.

Would you mind telling me what libstdc++ algorithm relies on this?

> They give me a way to prevent my iterator code from walking off the
> beginning or end of the underlying string.

> The only other way to do it
> would be to store a pointer to the string object in every iterator, or
> a pair of iterators or pointers to the underlying type, which I
> considered worse.

You seek the next "first" character of a code-unit sequence. That indeed
causes problems when you reach the end (unless you put a 0 at the end of
your buffer, which isn't such a bad idea).
UTF-8 and 16, however, don't require this. You can deduce how many code
units you need to consume from the first code unit. You do that in your
decoders.

> As an important side benefit, they also provide an
> efficient way to calculate the difference in code points for operator-,
> which I feel is important.
>
> And the size in code-points is (supposed to be) stored at all times.

What's the point of this?
The size in code points is not a very useful thing in general.

> All the UTF types were very carefully designed so that there's no
> chance of invalid data in them, barring extraordinary measures to
> deliberately corrupt it.

That's a good thing.
However, you could use this opportunity to make decoding much faster,
since you don't need to check for correctness anymore.

> Anything else? :-)

You would probably encounter problems on platforms where int is 8 or 16
bits.

Next message: Larry Evans: "Re: [boost] Adding libc++ support to boost."
Previous message: Edward Diener: "Re: [boost] [type_traits] extension has_operator_xxx - cv qualifiers and references"
In reply to: Chad Nelson: "Re: [boost] [UTF String] UTF String library 1.5 ready for perusal"
Next in thread: Chad Nelson: "Re: [boost] [UTF String] UTF String library 1.5 ready for perusal"
Reply: Chad Nelson: "Re: [boost] [UTF String] UTF String library 1.5 ready for perusal"

Date view	Thread view	Subject view	Author view

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk