Boost logo

Boost :

Subject: Re: [boost] [gsoc] Request Feedback for Boost.Ustr Unicode String Adapter
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2011-08-09 10:54:08


Soares Chen Ruo Fei wrote:
> A while ago I gave some previews of my Unicode String Adapter library
> to the boost community but I didn't receive much feedback. Now that
> GSoC is ending I'd like you all to take a look at my project again and
> provide feedback on the usefulness of the library. Following are the
> links to my project repository and documentation:
>
> GitHub repository: https://github.com/crf00/boost.ustr
> Documentation: http://crf.scriptmatrix.net/ustr/index.html

I think there are probably as many ways to implement a "better" string
as there are potential users, and previous long discussions here have
considered those possibilities at great length. In summary your
proposal is for a string that is:

- Immutable.
- Reference counted.
- Iterated by default over unicode code points.
- Provides access to the code units via operator* and operator->, i.e.

     s.begin() // Returns a code point iterator.
     s->begin() // Returns a code unit iterator.

I won't comment about the merits or otherwise of those points, apart
from the last, where I'll note that it is not to my taste. It looks
like it's "over clever". Imagine that I wrote some code using your
library, and then a colleague who was not familiar with it had to look
at it later. Would they have any idea about the difference between
those two cases? No, not unless I added a comment every time I used
it. Please let's have an obvious syntax like:

     s.begin() // Code points.
     s.impl.begin() // Code units.
  or s.units_begin() // Code units.

Personally, I don't want a new clever string class. What I want is a
few well-written building-blocks for Unicode. For example, I'd like to
be able to iterate over the code points in a block of UTF-8 data in raw
memory, so some sort of iterator adaptor is needed. Your library does
have this functionality, but it is hidden in an implementation detail.
Please can you consider bringing out your core UTF encoding and
decoding functions to the public interface?

I would also like to see some benchmarks for the core UTF conversion
functions. If you post some benchmarks that decouple the UTF
conversion from the rest of the string class, I will compare the
performance with my own code.

Regards, Phil.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk