Boost logo

Boost :

Subject: Re: [boost] [string] proposal
From: Chad Nelson (chad.thecomfychair_at_[hidden])
Date: 2011-01-25 10:06:42


On Tue, 25 Jan 2011 12:27:02 +0800
Dean Michael Berris <mikhailberis_at_[hidden]> wrote:

> On Tue, Jan 25, 2011 at 5:34 AM, Chad Nelson
> <chad.thecomfychair_at_[hidden]> wrote:
>>> Sure, however I personally don't see the value of making the
>>> encoding an intrinsic property of a string object. [...]
>>
>> Then I think we have different purposes, and I'll absent myself from
>> this part of the discussion after this reply.
>
> I don't think we have different purposes, I just think we're
> discussing two different levels.

I suspect that's the case.

> I for one want a string that is efficient and lightweight to use.
> Whether it encodes the data underneath as UTF-32 for convenience is
> largely of little consequence to me at that level.

I think I see where you're coming from: that if you had a set of
conversion iterators that would provide a view of the type in whatever
coding you want, the underlying coding matters very little to you. And
that's effectively what I'm aiming for with the current UTF string
revisions: they all provide a code-point iterator (which works the same
regardless of the underlying type, it always provides the 21-bit
code-point, as a 32-bit value), as well as a way to access the encoded
data.

> However, as I have already described elsewhere on a different
> message, "viewing" a string in a given encoding is much more scalable
> as far as design is concerned as it allows others to extend the view
> mechanism to be unique to the encoding being supported.

And if I understand it correctly (which I don't guarantee), that sounds
like a nice design for some kinds of programs. It's just not the one I'm
pursuing, because the problem it addresses doesn't seem to be the one
I'm trying to solve, which is efficiently storing, manipulating, and
converting Unicode data.

> This allows you to write algorithms and views that adapt existing
> strings (std::string, QString, CString, std::wstring, <insert string
> implementation here>) and operate on them in a generic manner. The
> hypothetical `boost::string` can have implicit conversion constructors
> (?) that deal with the supported strings, and that means you are able
> to view that `boost::string` instead in the view.

I'm avoiding the boost::string idea. It's great in theory, but it's
still pretty nebulous, and I can foresee someone spending a lot of time
trying different things out on it. "There comes a time in every project
when you have to shoot the engineers and put the damn thing into
production." :-) I want working code, or a design I can quickly turn
into working code, and the boost::string idea seems to be tangential to
the problem I'm working on.

>> Before I go, I'll note in passing that I've started on the
>> modifications to the UTF types, and I found that it made sense to
>> omit many of the mutating functions from utf8_t and utf16_t, at
>> least the ones that operate on anything other than the end of the
>> string.
>
> Actually, I think if you have the immutable string as something you
> use internally in your UTF-* "views", then you may even be able to
> omit even the mutation parts even those dealing with the end of the
> string. ;)

The underlying type of the UTF strings can't be an immutable string,
because the conversion functions have to operate one code-point at a
time. For instance, there's no way to know the code-point length of a
UTF-8 sequence without at least walking over it and examining most of
the bytes, which means that converting a raw UTF-8 string to UTF-32
would either have to use a mutable type in between (making it even less
efficient than using a mutable underlying string because it requires
two copy operations) or you'd have to walk it first to get the length,
create the immutable string's storage, then walk it again to convert
each character.

>> For what it's worth, that's the basic concept that I've adopted for
>> the utf*_t modifications. The utf*_t gives only a code-point
>> iterator (you can also get a char/char16_t/char32_t iterator from
>> the type returned by the encoded() function). I plan to write a
>> separate character iterator that will accept code-points and return
>> actual Unicode characters.
>
> I do suggest however that you implement/design algorithms first and
> build your iterators around the algorithms.
>
> I know that might sound counter-intuitive but having concrete
> algorithms in mind will allow you to delineate the proper (or more
> effective) abstractions better than thinking of the iterators in
> isolation.

The problem being that I don't know what the tasks someone might want
out of it. I'm aiming to provide the basics that any other algorithm
can be layered onto, and as many of std::string's capabilities as I can
manage.

-- 
Chad Nelson
Oak Circle Software, Inc.
*
*
*



Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk