Boost :

Date view	Thread view	Subject view	Author view

Subject: Re: [boost] [UTF String] UTF String library 1.5 ready for perusal
From: Chad Nelson (chad.thecomfychair_at_[hidden])
Date: 2011-02-10 19:52:54

Next message: Daniel Larimer: "Re: [boost] [type_traits] extension has_operator_xxx - cv qualifiers and references"
Previous message: Chad Nelson: "Re: [boost] [UTF String] UTF String library 1.5 ready for perusal"
In reply to: Mathias Gaunard: "Re: [boost] [UTF String] UTF String library 1.5 ready for perusal"
Next in thread: Scott McMurray: "Re: [boost] [UTF String] UTF String library 1.5 ready for perusal"
Reply: Scott McMurray: "Re: [boost] [UTF String] UTF String library 1.5 ready for perusal"
Reply: Mathias Gaunard: "Re: [boost] [UTF String] UTF String library 1.5 ready for perusal"

On Fri, 11 Feb 2011 00:07:58 +0100
Mathias Gaunard <mathias.gaunard_at_[hidden]> wrote:

>>> There is no need for any reasoning: look at the code of your code
>>> point iterator. It uses a pointer and indexes, and is therefore not
>>> a generic iterator adaptor.
>>
>> It wasn't meant to be generic. It was meant to be exactly what it is:
>> an iterator specific to the UTF type where it's defined. For that
>> purpose, it's designed exactly as it should be, IMHO.
>
> It was already stated on this list that the ability to deal with
> arbitrary ranges is more valuable; I am merely stating that your
> iterator could work with any iterator with virtually no change.

Either I missed that statement, or I don't recognize it in this
context, because I'm not sure what you mean.

>> I could make it fully generic, but it wouldn't be nearly as efficient
>> that way. I chose to do the extra work to make it efficient.
>
> Your code never uses the fact that the iterator is a pointer or that
> memory is stored contiguously.

How would you suggest that it use that information?

I think we have different definitions of generic. As I said in my last
message, I define fully generic as working with any UTF-encoded string
(not just UTF-8). That would be possible, but would almost certainly be
less processor-efficient than having iterators customized to each type.

> I also don't think that for example your utf-8 iterating strategy is
> very fast. Your utf-8 decoding itself seems to have lots of
> repetition and unnecessary tests and memory accesses...
>
> Not that it is easy to make that kind of thing fast anyway.

There's always room for further optimization, at the cost of more
programmer time and more code.

>> It could be a bidirectional iterator, as it has all of the abilities
>> of one. And it could be a random access iterator, as it has all but
>> one of the requirements for that (and in many cases has all of
>> them). Given that choice, I chose to make it a random access
>> iterator.
>
> It has constant-time distance, which isn't very useful and adds
> unnecessary overhead to your iterator.

I disagree. I find it extremely useful to have a true random access
iterator for some strings (most, in UTF-16, and arguably most in many
cases of UTF-8 too), and an emulated one for the rest. And for that,
the overhead isn't unnecessary.

>>> I'm not a fan of returning a reference in operator* as well.
>>
>> No choice in that, I ran into at least one STL algorithm under GCC
>> that wouldn't compile if it wasn't a reference, even when it was
>> only being read. I don't remember which one, but it was something
>> important and commonly-used enough that breaking it was not an
>> option.
>
> I believe the standard containers indeed require it to be a
> reference, but I'm not aware of any problems with any implementation.
>
> Would you mind telling me what libstdc++ algorithm relies on this?

If I recalled which one it was, I would have put it in the original
message.

>> They give me a way to prevent my iterator code from walking off the
>> beginning or end of the underlying string.
>>
>> The only other way to do it would be to store a pointer to the
>> string object in every iterator, or a pair of iterators or pointers
>> to the underlying type, which I considered worse.
>
> You seek the next "first" character of a code-unit sequence. That
> indeed causes problems when you reach the end (unless you put a 0 at
> the end of your buffer, which isn't such a bad idea).
> UTF-8 and 16, however, don't require this. You can deduce how many
> code units you need to consume from the first code unit. You do that
> in your decoders.

And I didn't want to duplicate that code in the iterators as well, or
separate it out and possibly add the overhead of another function call
to the decoder functions.

>> As an important side benefit, they also provide an efficient way to
>> calculate the difference in code points for operator-, which I feel
>> is important.
>>
>> And the size in code-points is (supposed to be) stored at all times.
>
> What's the point of this?
> The size in code points is not a very useful thing in general.

On the contrary, keeping track of the length of the string is *very*
useful. The alternative is to calculate it on the fly, whenever someone
asks for it. If you want that, you can always go back to using C-style
strings.

>> All the UTF types were very carefully designed so that there's no
>> chance of invalid data in them, barring extraordinary measures to
>> deliberately corrupt it.
>
> That's a good thing.
> However, you could use this opportunity to make decoding much faster,
> since you don't need to check for correctness anymore.

As I said, there's always room for further optimization.

>> Anything else? :-)
>
> You would probably encounter problems on platforms where int is 8 or
> 16 bits.

I haven't seen a platform where an int is 16 bits since DOS, which I
stopped coding for in the late nineties. And I've never seen one where
it's eight bits. Do you know of any modern platform -- as in one that
uses Unicode, and could usefully use this library -- where that's the
case?

-- 
Chad Nelson
Oak Circle Software, Inc.
*
*
*

application/pgp-signature attachment: signature.asc

Next message: Daniel Larimer: "Re: [boost] [type_traits] extension has_operator_xxx - cv qualifiers and references"
Previous message: Chad Nelson: "Re: [boost] [UTF String] UTF String library 1.5 ready for perusal"
In reply to: Mathias Gaunard: "Re: [boost] [UTF String] UTF String library 1.5 ready for perusal"
Next in thread: Scott McMurray: "Re: [boost] [UTF String] UTF String library 1.5 ready for perusal"
Reply: Scott McMurray: "Re: [boost] [UTF String] UTF String library 1.5 ready for perusal"
Reply: Mathias Gaunard: "Re: [boost] [UTF String] UTF String library 1.5 ready for perusal"

Date view	Thread view	Subject view	Author view

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk