Boost logo

Boost :

Subject: Re: [boost] [string] proposal
From: David Bergman (David.Bergman_at_[hidden])
Date: 2011-01-29 08:43:20


On Jan 29, 2011, at 7:33 AM, Dean Michael Berris wrote:

> On Sat, Jan 29, 2011 at 8:06 PM, Artyom <artyomtnk_at_[hidden]> wrote:

[snip]

>> You know what...
>>
>> I'd really like your data structure if you were not
>> calling it string but rather bytes chunk or immutable
>> bytes array.
>>
>> What you are suggesting has noting to do with text,
>> and I don't understand how do you fail to see this.
>>
>
> I don't know if you're not a native English speaker or whether you
> just really think strings are just for text.

First of all, in programming languages (at least the 20 or so that I master and in which I have developed software professionally), the notion of 'string' is that of text (and in some languages 'string' is nothing more than an alias for an array/vector of characters.)

But your intention is to use your "string" for other types of elements, i.e., to be what is called a 'vector' in C++, albeit immutable. No?

So, why are you complaining when Artyom actually wants you to call it exactly what you yourself *claim* it is.

It is you who bring confusion by:

1. sometimes arguing that it is nothing but a byte sequence,

2. sometimes arguing that *anything* can be stored in those sequences, and

3. sometimes talking about text and encodings - in the form of views - clearly indicating some very special use case for your byte sequence

Can you please clarify *which* notion you are after with your "string" proposal? So we understand the exact use case(s) for it? Since we (at least Artyom and myself) have this preconceived notion of what a 'string' is in a programming language, no matter how esoteric that preconception might be...

> Strings are a data structure (look it up).

Yes, definitely. I asked you if you meant computer-scientific "string" when you said something similar before and you said 'NO'. But, that is the definition and meaning you are alluding to now, is it not. If not, can you please provide a reference to the "string" you want Artyom to lookup.

And, actually, the (CS) string is a proper approach to the problem of (textual...) string as well: a sequence of symbols (in our world, 'character' of some form.) This is very important: it is a (finite...) sequence of *symbols* (characters...) which in our case(s) would be actual characters used in a (natural or not) language. It is *not* a sequence of bytes happening to represent a sequence of characters.

> Encoding is a way of
> representing or interpreting data in a certain way. I fail to see why
> encoding has anything to do with a data structure.

Encoding has nothing to do with the sequence of characters, except that in order to *represent* a (CS or 'textual') string one needs some type of encoding, and, yes, one that handles the characters in question (such as both Latin-1 and UTF-8 being able to handle the symbol 'Ä')

> So if I have data
> in a data structure, I should be able to apply an encoding on that
> data structure and "view" it a given way I want/need.

There are four layers in play here:

1. The sequence of characters/symbols, as in CS string; totally abstract but precise... (one such CS string can be represented as Unicode or Latin-1 at #2, for instance.)

2. The sequence of code points, in a given character set. Yes, one CS string (as in #1) can have multiple distinct manifestations at this level. They could be identical in integral sequences.

3. The sequence of code values, using an encoding form such as UTF-8 or UTF-16 for a Unicode code point.

4. The byte storage representing the code values; could be a contiguous sequence of bytes or chunks, etc.

It is quite clear that you are (in most posts, at least...) targeting #4 with your proposal. Is that not right? If so, two comments:

1. Why can't this byte storage type not be used for all kinds of things; is not 'string' a quite bad name for it, since it is neither a string according to most programming languages (see above) nor according to that CS definition that you are alluding to (unless you consider uninterpreted bytes to be symbols, but be quite aware that those 'symbols' would have nothing - or very little - to do with the symbols of the text represented through your construct.)

2. What is that 'view' notion of yours - it seems to involve a mixture of #2 and #3 above? In what way is it less unstable that reinterpret_cast<> ? I.e., does it make sense to be able to switch views?

>
> What I'm saying is, a string data structure should have clearly
> defined semantics -- hence the document going into the immutability,
> value semantics, etc. -- now encoding is largely a matter at a
> different level operating on strings. Encoding is an interpretation of
> strings.

No, encoding is a *representation* of a string (both in the 'text' sense and CS sense.) This difference is crucial. On the other hand: encoding is an interpretation of a byte sequence, *yielding* a string.

> *I* fail to see why *you* fail to understand this clear statement.

Because it is false? Again: a 'string' is *not* a sequence of uninterpreted (i.e., detached from encoding) bytes, neither in most programming languages nor in CS. If you have any other definition for 'string' you can provide that, but rest assured that most people will have their preconceived notions firmly established in one (or both) of the above fields.

/David


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk