Boost logo

Boost :

Subject: Re: [boost] [string] proposal
From: Dean Michael Berris (mikhailberis_at_[hidden])
Date: 2011-01-23 21:34:17


On Sat, Jan 22, 2011 at 10:43 AM, Chad Nelson
<chad.thecomfychair_at_[hidden]> wrote:
> On Sat, 22 Jan 2011 01:56:36 +0800
> Dean Michael Berris <mikhailberis_at_[hidden]> wrote:
>
>>>> I think strings are different from the encoding they're interpreted
>>>> as. Let's fix the problem of a string data structure first then tack
>>>> on encoding/decoding as something that depends on the string
>>>> abstraction first.
>>>
>>> That gets back to the problem that I was originally trying to solve
>>> with the UTF types: that a string needs a way to carry around its
>>> encoding. A UTF-8 type could be built on such a thing very easily.
>>
>> Hmm... I OTOH don't think the encoding should be part of the string.
>> The encoding is really external to the string, more like a function
>> that is applied to the string.
>
> It's a property of the string. It may change, but some encoding (even
> if it's just "none") should be associated with a particular string
> throughout its existence. Otherwise you might as well use the existing
> std::string.
>

I think I disagree with this. A string is by definition a sequence of
something -- a string of integers, a string of events, a string of
characters. Encoding is not an intrinsic property of a string.

As for using the existing std::string, I think the problem *is*
std::string and the way it's implemented. In particular I think
allowing for mutation of individual arbitrary elements makes users
that don't need this mutation pay for the cost of having it. Because
of this requirement things like SSO, copy-on-write optimizations(?),
and all the other algorithm baggage that comes with the std::string
implementation makes it really a bad basic string for the language.

In a world where individual element mutation is a requirement,
std::string may very well be an acceptable implementation. In other
cases where you really don't need to be mutating any character in the
string that's already there, well it's a really bad string
implementation.

For the purpose of interpreting a string as something else, you don't
need mutation -- and hence you gain a lot by having a string that is
immutable but interpretable in many different ways.

Consider the case where for example I want to interpret the same
string as UTF-8 and then later on as UTF-32. In your proposal I would
need to copy the type that has a UTF-8 encoding into another type that
has a UTF-32 encoding. If somehow the copy was trivial and doesn't
need to give any programmer pause to do that, then that would be a
good thing -- which is why an immutable string is something that your
implementation would benefit from in a "plumbing" perspective.

>> If you can wrap the string in a UTF-8, UTF-16, UTF-32 encoder/decoder
>> then that should be the way to go. However building it into the string
>> is not something that will scale in case there are other encodings
>> that would be supported -- think about not just Unicode, but things
>> like Base64, Zip, <insert encoding here>.
>
> I assume that there is some unique identification for each language and
> encoding, or that one could be created. But that's too big a task for
> one volunteer developer, so my UTF classes are intended only to handle
> the three types that can encode any Unicode code-point.
>

Sure, but that doesn't mean that you can't design it in a way that
others can extend it appropriately. This was/is the beauty of how the
iterator/range abstraction works out for generic code.

>> Ultimately the underlying string should be efficient and could be
>> operated upon in a predictable manner. It should be lightweight so
>> that it can be referred to in many different situations and there
>> should be an infinite number of possibilities for what you can use a
>> string for.
>
> You've just described std::string. Or alternately, std::vector<char>.

Except these are mutable containers which are exactly what I *don't* want.

-- 
Dean Michael Berris
about.me/deanberris

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk