Subject: Re: [boost] [string] proposal
From: Dean Michael Berris (mikhailberis_at_[hidden])
Date: 2011-01-29 10:14:25
On Sat, Jan 29, 2011 at 9:43 PM, David Bergman
> On Jan 29, 2011, at 7:33 AM, Dean Michael Berris wrote:
>> On Sat, Jan 29, 2011 at 8:06 PM, Artyom <artyomtnk_at_[hidden]> wrote:
>>> You know what...
>>> I'd really like your data structure if you were not
>>> calling it string but rather bytes chunk or immutable
>>> bytes array.
>>> What you are suggesting has noting to do with text,
>>> and I don't understand how do you fail to see this.
>> I don't know if you're not a native English speaker or whether you
>> just really think strings are just for text.
> First of all, in programming languages (at least the 20 or so that I master and in which I have developed software professionally), the notion of 'string' is that of text (and in some languages 'string' is nothing more than an alias for an array/vector of characters.)
> But your intention is to use your "string" for other types of elements, i.e., to be what is called a 'vector' in C++, albeit immutable. No?
I wonder where you got that notion. I framed the discussion around my
definition of `string` to be a sequence. In that context (in an
earlier post) I was basically saying "a string is a data structure for
holding things, [FOR EXAMPLE] a string of events, a string of
characters, ..." just to frame the definition properly and identify
that I was talking about a data structure.
I had no other intention of implementing a string of events, but
mostly that data structure is already there.
Lifting the notion of what a string is, is what I did.
And I explained it as well (I hate going meta on english like this,
but hey...) that it was a linguistic tool used to set the stage for a
discussion. It's more like "setting the basis from which your
arguments will be based on". I never though this would be such an
issue in presenting the idea of what a string *data structure* is.
> So, why are you complaining when Artyom actually wants you to call it exactly what you yourself *claim* it is.
> It is you who bring confusion by:
> 1. sometimes arguing that it is nothing but a byte sequence,
> 2. sometimes arguing that *anything* can be stored in those sequences, and
> 3. sometimes talking about text and encodings - in the form of views - clearly indicating some very special use case for your byte sequence
> Can you please clarify *which* notion you are after with your "string" proposal? So we understand the exact use case(s) for it? Since we (at least Artyom and myself) have this preconceived notion of what a 'string' is in a programming language, no matter how esoteric that preconception might be...
A string, is something that contains data, is immutable, and on which
you can perform primitive operations on that define the whole "string
calculus" I refer to in the paper. It's like asking me to define what
a number is in math when it's just really a value that plays along a
set of concepts and within a certain set of rules.
I can describe to you the concept of a string -- and that is general
by design, so that we can talk about interfaces and all that
design-goodness jazz. What you can't make me do is say "a string is a
series of characters with an encoding" because that's not what
describes the *concept* of a string. Now a string is a data structure
in my mind and I'm trying my best to explain how that data structure
doesn't include an encoding.
Now the *view* mechanism is what allows for *interpreting the data in
a string (whatever that is)* and looking at it a certain way. An
encoding is something you apply to data to make it look like a certain
thing -- in your and Artyom's case that's just a UTF-8 encoded view of
the underlying data in a string.
>> Strings are a data structure (look it up).
> Yes, definitely. I asked you if you meant computer-scientific "string" when you said something similar before and you said 'NO'. But, that is the definition and meaning you are alluding to now, is it not. If not, can you please provide a reference to the "string" you want Artyom to lookup.
See, the answer I gave about whether it's a CS-string was clear: yes,
it is a string data structure. And I thought I was clear.
> And, actually, the (CS) string is a proper approach to the problem of (textual...) string as well: a sequence of symbols (in our world, 'character' of some form.) This is very important: it is a (finite...) sequence of *symbols* (characters...) which in our case(s) would be actual characters used in a (natural or not) language. It is *not* a sequence of bytes happening to represent a sequence of characters.
So on a computer, let's be frank, what is it that you deal with --
isn't it bytes? So we can run around in circles about whether it's a
character string, an event string, or a wide character string, in the
end, they are *bytes*. Now of course in some encodings a character may
be of a minimum and (potentially, but not required) maximum number of
bytes "per element" or "symbol" if you like. In other encodings it's a
fixed size (as in ASCII). However the string stores it is largely
inconsequential as long as you view it in a consistent manner.
>> Encoding is a way of
>> representing or interpreting data in a certain way. I fail to see why
>> encoding has anything to do with a data structure.
> Encoding has nothing to do with the sequence of characters, except that in order to *represent* a (CS or 'textual') string one needs some type of encoding, and, yes, one that handles the characters in question (such as both Latin-1 and UTF-8 being able to handle the symbol 'Ã')
An encoding, is largely a function that you apply to input to yield output.
Given a string 's' and a function 'encode', an encoded string is the
result of `encode(s)`.
What's so hard to understand there?
>> So if I have data
>> in a data structure, I should be able to apply an encoding on that
>> data structure and "view" it a given way I want/need.
> There are four layers in play here:
> 1. The sequence of characters/symbols, as in CS string; totally abstract but precise... (one such CS string can be represented as Unicode or Latin-1 at #2, for instance.)
Yes, but it doesn't *have* to be *defaulted* to something which is my
whole point all along.
> 2. The sequence of code points, in a given character set. Yes, one CS string (as in #1) can have multiple distinct manifestations at this level. They could be identical in integral sequences.
> 3. The sequence of code values, using an encoding form such as UTF-8 or UTF-16 for a Unicode code point.
> 4. The byte storage representing the code values; could be a contiguous sequence of bytes or chunks, etc.
> It is quite clear that you are (in most posts, at least...) targeting #4 with your proposal. Is that not right? If so, two comments:
No, you missed it too. I was presenting the foundation (#4) so that I
can build upon it 1, 2, and 3. The approach is bottom up instead of
> 1. Why can't this byte storage type not be used for all kinds of things; is not 'string' a quite bad name for it, since it is neither a string according to most programming languages (see above) nor according to that CS definition that you are alluding to (unless you consider uninterpreted bytes to be symbols, but be quite aware that those 'symbols' would have nothing - or very little - to do with the symbols of the text represented through your construct.)
It actually can. But that only makes sense if the operations make
sense on it. For example if you put the raw byte sequences for a float
into a string -- does concatenation make sense for that data? Maybe in
your application yes, but what's *in* the string is largely
inconsequential to the algorithms you apply to the string. Now if you
had a view which wrapped this concatenated byte sequences of floats
that yielded a pair of floats, isn't the abstraction still
appropriate? I'll leave you to work that out on your own.
> 2. What is that 'view' notion of yours - it seems to involve a mixture of #2 and #3 above? In what way is it less unstable that reinterpret_cast<> ? I.e., does it make sense to be able to switch views?
Because reinterpret_cast assumes that the data referred to in the
pointer is contiguous. And I have already maintained that the string's
implementation will explicitly be non-contiguous so that you cannot
assume that the data it contains is contiguous. Now does that make
>> What I'm saying is, a string data structure should have clearly
>> defined semantics -- hence the document going into the immutability,
>> value semantics, etc. -- now encoding is largely a matter at a
>> different level operating on strings. Encoding is an interpretation of
> No, encoding is a *representation* of a string (both in the 'text' sense and CS sense.) This difference is crucial. On the other hand: encoding is an interpretation of a byte sequence, *yielding* a string.
Encoding is an algorithm (a transformation if you will) applied to
*data* and what's yielded is an encoded result. So when you take a
string (as how I define it) and you wrap it in a "raw view" then you
get the raw data in the string as exposed by iterators. Then think
about what happens when you view a string in a given encoding; let's
say you have BOM markers in the beginning of a byte sequence, or
somehow have data at the start of it that gives information about what
the encoding is -- then you can make a generic 'view' that can handle
this data appropriately. The possibilities are endless here.
So when you see `view<utf8_encoded>` that tells you whatever string
you wrap with this view will be viewed as a UTF-8 encoded "text" in
your parlance -- actually what's going to happen is you have access to
iterators that yield appropriately-typed code-points or "characters".
>> *I* fail to see why *you* fail to understand this clear statement.
> Because it is false? Again: a 'string' is *not* a sequence of uninterpreted (i.e., detached from encoding) bytes, neither in most programming languages nor in CS. If you have any other definition for 'string' you can provide that, but rest assured that most people will have their preconceived notions firmly established in one (or both) of the above fields.
A string is a data structure that contains data, has defined
operations on data, and is largely viewed as a container -- in my
definition, it is also immutable. Much like how numbers in math are
defined by a concept, a higher level concept of string that defines
its interface and semantics is what I have presented.
Also whatever the encoding of the underlying data in a string is is
largely inconsequential as a matter of the operations defined on it.
Truth is largely a matter of agreement I find as well so long as we
all agree to what the definition of truth is. Sorry to go
philosophical on you but really saying "it's false" is different from
saying "because I disagree".
-- Dean Michael Berris about.me/deanberris
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk