Boost logo

Boost :

Subject: Re: [boost] [Potentially OT] String Concatenation Operator
From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2010-08-26 07:07:27


On 26/08/10 08:09, Dean Michael Berris wrote:

> Indeed.
>
> However, I am looking for a means of representing a string -- a
> collection of characters on which you can implement algorithms on.
> They might as well be ranges, but things like pattern matching and
> templates (as in string templates, where you have placeholders and
> other things) can be applied to or used to generate them.

A string is just a range of characters; nothing more.

>> I suppose you'd have to linearize it.
>> Sending it in multiple chunks would have different behaviour on different
>> types of sockets.
>>
>
> Yes, but something that is inherently supported by the type.

There is no "inherently supported" or "not inherently supported".
Linearizing the memory means you have to copy it in a contiguous buffer.

> Why strings are important has a lot to do with being able to perform
> domain-specific optimization on string algorithms. Things like
> capitalization, whitespace removal, encoding/decoding, transformations
> like breaking up strings according to some pattern (tokenization,
> parsing, etc.). Because you can specialize the memory-management of
> strings (as opposed to just ranges of char's) the "win" in treating a
> string as a separate type are practical rather than conceptual.

I've got a system that does arbitrary conversion and segmentation of
ranges in a lazy way without ever allocating memory.
I can use it for things like on-the-fly character encoding conversion,
Unicode normalization, or base64 encoding, for a few examples.

It never requires to differentiate strings from other types of ranges,
not to use special containers or anything.

>>> What I wanted to be able to do (and am reproducing at the moment) is a
>>> means of doing the following:
>>>
>>> string_handle f = /* some means of building a string */;
>>> string_handle s = str("Literal:") ^ f ^ str("\r\n\r\n");
>>
>>> std::string some_string = string_handle; // convert to string and build
>>> lazily
>>
>> How about:
>>
>> boost::lazy_range<char> f = /* some means of building a string */
>>
>> boost::lazy_range<char> s = boost::adaptors::join(
>> boost::as_literal("Literal"),
>> f,
>> boost::as_literal("\r\n\r\n")
>> );
>>
>> std::string some_string;
>> boost::copy(s, std::back_inserter(some_string));
>>

Just in case you haven't noticed, that's exactly the same code as you
wrote, except it's real code that works and it uses different names.

> That's fine if I can control the memory allocation of the lazy range.

Using type erasure means you have to use dynamic memory allocation
eventually, unless you limit it to use stack storage and refuse types
that are too large. (which in practice should be able to work fine)

> As it is, a lazy range just represents a collection of iterator pairs

Just the one pair, actually.

> -- the data has to live somewhere still.

Having it live in your new "string type" is a bad idea, as this just
makes things not inter-operable.
Simply allow people to use whatever data structure they see fit for
their storage needs, and just deal with ranges when you want to
manipulate strings.

> What I'm looking for is a
> combined ownership+iteration mechanism. Right now the problem of
> allocating a chunk of memory every time a you concatenate two strings
> is the problem I'm trying to solve with metaprogramming and knowing at
> compile time how much memory I'm going to need to allocate.

What meta-programming will give you is a type that represents an
expression that you can then evaluate. It doesn't own the values in any way.

Range adaptors are exactly the same thing, they're a lightweight DSEL.

> Of course if you're dealing with strings that have an unknown length
> (as in my example, we really can't tell the length of 'f' at the point
> 's' is defined) at least getting to know the parts that have a known
> length at compile time (the literals) allows me to allocate enough
> space ahead of time with the compiler's help.

With ranges, the way it works is depending on whether the range you want
to get the size of has random access or not.
If it is, getting the size is O(1). If it isn't, it's O(n).
If it's a single-pass range, you can't even get the size.

a joined_range has the lowest category of the ranges it's joining, so if
one of the ranges isn't random access, then the concatenated range isn't.

> Maybe instead of having
> multiple concatenations, what happens is I allocate a chunk "just
> large enough" to hold the multiple concatenated strings, and just
> traverse the lazy string as in your example. The copy happens at
> runtime, the allocation of a large enough buffer (maybe a
> boost::array) happens at compile-time (or at least the determination
> of the size).

You can't get the size of a range at compile-time.

If you're only dealing with statically-sized stuff, I suggest you use
fusion sequences, which is basically the same thing as ranges except
they're statically-sized and can have heterogeneous elements. (and can
make your compile-time explode)
There is a boost::fusion::joint_view that behaves the same as boost::join.

You can't really mix both runtime and compile-time sized things
together. It's either one or the other, since compile-time code and
runtime code at different beasts. (of course, you can consider a fusion
sequence as a range, but then you lose the information it was statically
sized when manipulating it as a range)

>
>> boost::lazy_range is not actually in boost, but that would be a type erased
>> range, and I've got an implementation somewhere.
>
> Maybe a lazy_range would be nice to have in Boost. Or even just a join iterator.

The join iterator is already there (since 1.43 I think? maybe before
that), I wouldn't have put it in my code without saying it's not in
Boost otherwise.

Type erasure for ranges and iterators isn't in Boost, I suppose because
there is little interest given using the real type is much more
efficient, it would need allocations policies, and there is still hope
that someday we'll have a generic type erasure facility.

>> Of course, not using type erasure at all (i.e. replacing lazy_range by auto
>> or the actual type of the expression) would allow it to be quite faster.
>>
>
> Definitely.

Then why the string_handle thing? You want a type that represents the
expression.

> The hope was to be able to determine at least the length of the whole
> string from the concatenation of multiple strings

Nothing particularly hard to do there.

> so that effective
> memory allocation can be done at the time it's needed, which is
> usually at the time a copy of the whole string is required.

The problem is that you seem to want to apply stateful operations on
your string object and delay them until you copy it, and have it
automagically deal with lifetime and ownership issues.

Maybe the right solution for you is to use a rope data structure, which
is a container and doesn't actually concatenate things, it just puts the
fragments in a self-balancing tree.
Never really understood the point of that data structure myself.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk