Boost logo

Boost :

Subject: Re: [boost] [UTF String] UTF String library 1.5 ready for perusal
From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2011-02-11 05:44:03


On 11/02/2011 01:52, Chad Nelson wrote:
>> It was already stated on this list that the ability to deal with
>> arbitrary ranges is more valuable; I am merely stating that your
>> iterator could work with any iterator with virtually no change.
>
> Either I missed that statement, or I don't recognize it in this
> context, because I'm not sure what you mean.

Which statement are you referring to?
The piece you quoted contains two different statements.

>
>>> I could make it fully generic, but it wouldn't be nearly as efficient
>>> that way. I chose to do the extra work to make it efficient.
>>
>> Your code never uses the fact that the iterator is a pointer or that
>> memory is stored contiguously.
>
> How would you suggest that it use that information?

You said it was more efficient to only work with pointers than to work
with arbitrary iterators.

Again, I'm merely stating that you are never relying on the fact that
your pointer is a pointer, so your code could very well work with any
iterator.

> I think we have different definitions of generic. As I said in my last
> message, I define fully generic as working with any UTF-encoded string
> (not just UTF-8). That would be possible, but would almost certainly be
> less processor-efficient than having iterators customized to each type.

I don't see what that has to do with genericity, nor how that would
incur any runtime overhead.
Just dispatch (through template specialization or overloading) to
different iterator types depending on the size type of the underlying
data range.

> There's always room for further optimization, at the cost of more
> programmer time and more code.

Well, a tool for encoding/decoding UTF strings is only as good as:
  - the quality of its codec implementations
  - how flexible those codecs are to be able to work with the user's data

I'm afraid your library is not particularly good on the first point, and
is rather bad on the second.

>> It has constant-time distance, which isn't very useful and adds
>> unnecessary overhead to your iterator.
>
> I disagree. I find it extremely useful to have a true random access
> iterator for some strings (most, in UTF-16, and arguably most in many
> cases of UTF-8 too), and an emulated one for the rest. And for that,
> the overhead isn't unnecessary.

constant-time distance does not give you random access, so I don't know
where you're going at.

If you want pseudo-random access, use std::advance.

> If I recalled which one it was, I would have put it in the original
> message.

So you base your claims on vague memories, I see.
I also see your iterators are missing operator->. It is usually better
to use boost::iterator_facade or boost::iterator_adaptor to define
iterators.

> And I didn't want to duplicate that code in the iterators as well, or
> separate it out and possibly add the overhead of another function call
> to the decoder functions.

The fact you would need to duplicate this is proof of how inflexible
your design is.

So you're adding overhead with meaningless data and essentially
computing redundant things because you can't restructure your code to do
it the right way. Interesting.

> On the contrary, keeping track of the length of the string is *very*
> useful. The alternative is to calculate it on the fly, whenever someone
> asks for it. If you want that, you can always go back to using C-style
> strings.

Again, extracting the size in code points of a string is not a
particularly useful operation, so I don't really see the point of
maintaining it.

Do you have real examples of where it is useful to have that operation
be O(1) instead O(n)?

>> That's a good thing.
>> However, you could use this opportunity to make decoding much faster,
>> since you don't need to check for correctness anymore.
>
> As I said, there's always room for further optimization.

Well, the whole point of enforcing validity is to make use of it; not
making use of it doesn't really demonstrate that possibility in your design.
I'm afraid doing it with your design would also require a lot of code
duplication.

>> You would probably encounter problems on platforms where int is 8 or
>> 16 bits.
>
> I haven't seen a platform where an int is 16 bits since DOS, which I
> stopped coding for in the late nineties. And I've never seen one where
> it's eight bits. Do you know of any modern platform -- as in one that
> uses Unicode, and could usefully use this library -- where that's the
> case?

If you want to be included in Boost, it is good measure to not restrict
yourself to non-portable assertions when there is absolutely no need to
or no gain from doing so.

As far as I know, only DSPs these days have such properties, and it does
seem unlikely one would want to use these for Unicode text processing.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk