Boost logo

Boost :

Subject: Re: [boost] [Potentially OT] String Concatenation Operator
From: Dean Michael Berris (mikhailberis_at_[hidden])
Date: 2010-08-29 03:23:27


On Sat, Aug 28, 2010 at 5:30 AM, Mathias Gaunard
<mathias.gaunard_at_[hidden]> wrote:
> On 27/08/2010 03:50, Dean Michael Berris wrote:
>
>> Actually, not *just* a range of characters.
>>
>> If your range dealt with ownership of data and abstracted the means by
>> which the data is actually manipulated (even through iterators) then
>> maybe a string is just a range of characters. If your range made sure
>> that the data is moved instead of copied in certain situations, or
>> whether there is an optimization that can be made on certain
>> operations (like concatenation) then yeah I'll agree that a string is
>> just a range of characters.
>>
>> If you think about a string as a special beast that does a lot of things:
>>
>> 1. Allocates memory which holds the characters
>> 2. Offers an abstraction unique to strings (token_iterator,
>> line_iterator, char_iterator, wchar_iterator)
>> 3. Has value semantics similar to built-in types (copyable, movable,
>> "regular")
>>
>> Then it doesn't look like just a range.
>
> What I call a string is just the data, not a container of said data.
>
> I don't see what 2 is doing in there. Surely those mechanisms are
> independent of the container.
>

The data is the container when it comes to the (ugly) world of
strings. You can treat a string as a special case of a container that
is also the data. The string "Foo" is really a self-contained
data-type which has the properties of size (amount of memory used) and
contents. For literal strings much of the information (except maybe
the contents) is available at compile time, and using a pure runtime
construct such as a range is inadequate IMO -- thus the question about
a different (if not better) means of concatenating literal strings and
by extension strings in general, abstracted enough by a string type
that knows how to deal with these things both at compile time and
runtime.

Like I said, you can treat a string as a range through the iterators,
but really looking at a string as *just* a range without thinking
about the implications of efficient memory management and value
semantics outside the realm of ranges is lacking IMO.

>
>> Sure, but if the string type already does that for you at the time you
>> need it, then it's inherently supported by the type, no?
>
> If you can iterate through a sequence of copyable elements, then you can
> copy those elements into a new sequence of the structure of your choosing.
> This doesn't require any particular support from the first sequence type.
>

My contention is that you don't need to iterate through a sequence of
copyable elements earlier than you have to. A string type should be
allowed and should be able to delay these operations until the point
of when the information is required. Again unless you can return a
range from a function and ensure that the data that the range refers
to is valid (unlike a container which does guarantee that) then ranges
would be insufficient for that case.

>
>> That's cool, but in cases where you deal with potentially huge strings
>> (i.e. more than a memory page's worth of data) you start looking for
>> ways of moving some of the work out of runtime and into compile time
>
> This doesn't make sense.
> If the data is too big to fit into memory, then it's going to be completely
> out of the range (no pun intended) of what the compile-time world can deal
> with.
>

Actually not so. You can know the length of a string at compile time
or at least the maximum length of the string at compile, then it costs
you only the size of the type that holds the information about the
string. It doesn't mean that you need the whole string in memory
already at compile time.

If for example a string is known to have only a maximum length of 4096
bytes and a minimum of 100 bytes at compile time, then the string type
can then allocate only as much as 100 bytes enough to represent that
sub-string. At runtime then you already have a buffer that is at least
100 bytes long, and a string type that exposes an iterator that uses
an in-memory window of 100 bytes and re-use that buffer that it
already allocated and knew the size of statically. Think about strings
that are mmapped to files or those that are built from socket IO. In
cases where you know the string will always be 4096 bytes at compile
time, then what's the point of not allocating all the 4096 bytes as a
boost::array for example?

>
>> And especially in
>> cases where you need to abstract the string from being something that
>> is exclusively in memory to something that refers to data that is not
>> in memory (retrieved from a socket, from a file, from user input) you
>> run into issues like buffer management, demand-driven/lazy-loading
>> data, etc. that a range is not the end-all and/or best solution for.
>>
>> For instance, think about a forward iterator (or a single-pass
>> iterator?) that the string can expose to allow for one-time traversal
>> of the data it holds or it refers to. The string doesn't have to have
>> the data all in memory if it doesn't yet, and it can then lazily load
>> the data and expose it through that forward iterator.
>
> I don't get your argument.
> Basically you're saying "wouldn't it be cool if you could not actually store
> your data in memory, but generate it as you're traversing or get it from I/O
> on demand?". That's what ranges are.
> Ranges *are* iterators.
>

I'm not saying ranges don't work. I'm saying for the purpose of doing
a better string, thinking just in ranges is lacking.

Nothing is stopping the new string type to be smarter than your
average string and still be dealt with like a range. I'm just saying
that thinking about it as *just* a range is missing the forest for the
trees.

>> If the string handle knew that some part
>> of the string was a conglomeration of statically-sized literals then
>> it can hold that data in a boost::array of the correct size
>
> No it cannot.
> string_handle is a single type defined in advance, and it must have a finite
> size. It cannot guess how many conglomerations of statically-sized literals
> you're going to want to put into it, and therefore can't guarantee enough
> storage.
>

You can actually -- if you treat the operator= as your evaluator, you
can have a temporary "statically-sized" boost::array that holds the
conglomerated data, wrapped in a template type that derives from a
virtual base, which exposes the appropriate iterators as a pimpl. You
can then as part of either the copy constructor, move constructor,
and/or the copy/move assignment operator for string_handle.
Pseudocode:

struct base;

template <size_t size_>
struct data : base {
  // implementation here
};

struct string_handle {
  // ...
  template <class AST>
  string_handle & operator=(AST const & rhs) {
    boost::array<typename char_<AST>::type, sizeof_<AST>::value> buffer;
    eval(rhs, buffer); // perform the copy to the buffer
    this->pimpl(new data<sizeof_<AST>::value>(move(buffer)));
    return *this;
  }
  // ...
  unique_ptr<base> pimpl;
};

You can even think about a list of variants to hold conglomerated
literals, bounded strings, and string_handle instances, and then
expose iterators that know how to traverse the list of variants
maintained by the string_handle.

>
>> then you can correctly
>> allocate enough space at the point where the operator= is implemented.
>
> At runtime; (operator= is a function that executes code, not a type
> definition that provides automatic storage) which makes the memory
> dynamically allocated, as I said.
>

Sure, but operator= can be a template function that can also perform
computations on the types/statically-known-data at compile time. It's
not just about avoiding dynamic memory, it's all about knowing the
sizes at compile-time.

>
>> Of course the type erasure might be an issue, but knowing the eventual
>> size at compile time allows you a lot of optimizations you otherwise
>> can't or won't do.
>
> Here, I can't see it bringing any benefit compared to knowing it at runtime,
> since the allocation can only happen at runtime.
>

If you know that you have 10 adjacent literal strings from the Proto
AST then you can transform that into one long literal buffer and know
the size at compile time. Even if your allocation happens at runtime,
knowing the size upfront saves you 9 unnecessary temporary
allocations, or 9 iterator pairs in case of a joined range. I don't
see why that's *not* a saving.

>
>
>> Yes, but then you have a left-leaning tree of iterator pairs. ;)
>
> Just like you have a tree of a bounded segments, or an AST if you go for a
> proto solution.
>

Yes, but you can definitely do a lot more at compile time with a proto
AST than a left-leaning tree of iterator pairs in the case of just
runtime ranges.

>
>
>> If the new string type implemented iterators (several types of
>> iterators in fact) and manages the memory for you in a configurable
>> manner, *and* allows you to convert it to either an std::string or an
>> std::wstring, why wouldn't it be inter-operable?
>
> Because your code will expect your string type, not the string type of the
> user, meaning he'll have to convert to it.
> Converting is not actually needed, since you could just treat your type, the
> type of the user, or any smart lazy evaluated type the same.
>

But if the client code didn't even see this type -- it can be that
magical you know -- and it plays well with the other STL-string like
types, why won't it be interoperable?

>
>> But the range adaptors don't solve the issue of multiple allocations
>
> How do they not?
>
> Your problem is that you reallocate multiple times with a classic binary
> operator+ implementation: (pseudo-code)
>
> buffer = a + b + c
>
> is
>
> buffer = allocate(size(a))
> copy(buffer, a)
>
> buffer2 = allocate(size(buffer)+size(b))
> copy(buffer2, buffer)
> cat(buffer2, b)
> free(buffer)
>
> buffer = allocate(size(buffer2)+size(c))
> copy(buffer, buffer2)
> cat(buffer, c)
> free(buffer2)
>
> Instead of that you could just evaluate that expression as
>
> r = join(a, b, c)
> buffer3 = allocate(size(r))
> copy(buffer3, r)
>
> As you can see, it solves the problem just fine.
>

Not if a, b, or c are literals. The call to size(r) is not evaluated
at compile-time, and you'll have to store a, b, c, somehow as iterable
strings -- temporary std::string objects -- each one being an
allocation -- instead of just a pointer to statically-located data as
what a compiler will usually do for string literals. The traversal for
size(r) is also going to happen at runtime and is potentially an O(n)
operation.

>
>> Ah, because 'auto' is C++0x and I was still thinking that maybe people
>> not moving to C++0x might as well want to be able to use this new
>> string type even at C++03.
>
> You can write the type out yourself or use a result_of-like protocol, like
> Fusion does.
>

Sure, that's one option, but the simple type erasure that costs
exactly one indirection would be just fine.

>
>
>> Yes! Well, technically not stateful operations -- they're functionally
>> pure operations: concatenation doesn't munge with the existing
>> strings, it just returns a new string made of the contents of the
>> other strings (or string generators).
>
> If you end up re-storing the result into the same variable, it basically is
> stateful.
>

But you don't in the case of concatenation -- strictly speaking, the
assignment operation is the stateful part, not the concatentation. In
the situation I'm exploring, Concatenation is non-immediate (more
correctly, is a lazy operation) and is non-destructive.

Of course, the string will eventually have to expose the iterators
anyway, and all the range operations will still apply. Except the
smarter string will already have bought you the optimizations possible
at compile time and at runtime.

-- 
Dean Michael Berris
deanberris.com

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk