|
Boost : |
Subject: Re: [boost] [string] proposal
From: Patrick Horgan (phorgan1_at_[hidden])
Date: 2011-01-24 02:04:19
On 01/23/2011 06:34 PM, Dean Michael Berris wrote:
> ... elision by patrick ...
> I think I disagree with this. A string is by definition a sequence of
> something -- a string of integers, a string of events, a string of
> characters. Encoding is not an intrinsic property of a string.
I'm with you here, but to be fair to Chad, you could add to that list a
string of utf-8 encoded characters. If a string contains things with a
particular encoding there's value in being able to keep track of
whether it's validly encoded. It may very well be that a std::string is
part of another type, or that there's some encoding wrapper that lets
you see it as utf-8 in the same way an external iterator lets you look
at chars.
> As for using the existing std::string, I think the problem *is*
> std::string and the way it's implemented. In particular I think
> allowing for mutation of individual arbitrary elements makes users
> that don't need this mutation pay for the cost of having it. Because
> of this requirement things like SSO, copy-on-write optimizations(?),
> and all the other algorithm baggage that comes with the std::string
> implementation makes it really a bad basic string for the language.
So you're saying that there _also_ needs to be an immutable string type
that wouldn't pay this penalty.
> In a world where individual element mutation is a requirement,
> std::string may very well be an acceptable implementation. In other
> cases where you really don't need to be mutating any character in the
> string that's already there, well it's a really bad string
> implementation.
So what's wrong with having two different strings?
> For the purpose of interpreting a string as something else, you don't
> need mutation -- and hence you gain a lot by having a string that is
> immutable but interpretable in many different ways.
>
> Consider the case where for example I want to interpret the same
> string as UTF-8 and then later on as UTF-32.
Are you saying that you try it as utf-8, it doesn't decode and then you
try utf-32 to see if it works? Cause the same string couldn't be
both. Or are you saying that the string has some underlying encoding
but something lets it be viewed in other encodings, for example it might
actually be EUC, but external iterators let you view it as utf-8 or
utf-16 or utf-32 interpreting on the fly?
> In your proposal I would
> need to copy the type that has a UTF-8 encoding into another type that
> has a UTF-32 encoding. If somehow the copy was trivial and doesn't
> need to give any programmer pause to do that, then that would be a
> good thing -- which is why an immutable string is something that your
> implementation would benefit from in a "plumbing" perspective.
You could imagine:
utf-8_string u8s;
utf-32_string u32s;
// some code that gives a value to u32
u8s = u32s; // this would use a converting _copy_ constructor
That would be cool. But what if someone had one of these that
represented an edit buffer and was doing a global search and replace? I
suppose then the underlying string would not be able to be the immutable
one. Perhaps the std::string or std::immutable_string would be a
template argument to basic_utf_string<encoding,stringtype>.
>>> If you can wrap the string in a UTF-8, UTF-16, UTF-32 encoder/decoder
>>> then that should be the way to go. However building it into the string
>>> is not something that will scale in case there are other encodings
>>> that would be supported -- think about not just Unicode, but things
>>> like Base64, Zip,<insert encoding here>.
>> I assume that there is some unique identification for each language and
>> encoding, or that one could be created. But that's too big a task for
>> one volunteer developer, so my UTF classes are intended only to handle
>> the three types that can encode any Unicode code-point.
>>
> Sure, but that doesn't mean that you can't design it in a way that
> others can extend it appropriately. This was/is the beauty of how the
> iterator/range abstraction works out for generic code.
That's a wonderful idea, you could design it to work with statefull
encodings like JIS and EUC and non-statefull encodings like the utf
encodings.
>>> Ultimately the underlying string should be efficient and could be
>>> operated upon in a predictable manner. It should be lightweight so
>>> that it can be referred to in many different situations and there
>>> should be an infinite number of possibilities for what you can use a
>>> string for.
>> You've just described std::string. Or alternately, std::vector<char>.
> Except these are mutable containers which are exactly what I *don't* want.
But of course as you said before that if you _do_ want mutability then
std::string is acceptable. It seems that we just need a lighter weight
immutable addition to the fold.
Patrick
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk