Boost logo

Geometry :

Subject: Re: [geometry] 1.57 Storing indices as values to save space.
From: Tim Finer (tim.finer_at_[hidden])
Date: 2014-11-26 18:56:51


Hi Adam,

On 11/26/14, 11:49 AM, Adam Wulkiewicz wrote:
> Hi Tim,
>
> Tim Finer wrote:
>> Hi Adam,
>>
>> Actually, I think what rtree is doing with the temporary vector in
>> the packing constructor should be considered a bug.
>>
>> That problem is that the vector is created without regard to the
>> rtree's allocator - so even if a user gives the rtree a memory mapped
>> file allocator, the rtree creates a copy of the incoming data in
>> memory, ignoring the allocator.
>>
>
> Not really a copy of incoming data. A calculated centroids and
> iterators to the original data.
Ah I see. IndexableGetter must be the source for the centroid though,
right? Can you use the IndexableGetter to calculate and sort on the
centroid without allocating a temporary vector of them?

>
> Yes, rtree's allocator could be passed to the vector (as mentioned in
> the other email). Then iterators shouldn't be stored since they might
> be not suitable for storing in shared memory. So probably a copy of
> actual objects should be made here. And this could be done only if an
> allocator was different than std::allocator<> so at least in this case
> it wouldn't harm the performance. Also in the case of passing
> RandomAccessIterators elements indexes could be stored instead of
> iterators.
Except I think the word "should" belongs there instead of "could". :)
The whole point of passing in an allocator is for naught if the class
selectively uses it - it doesn't matter about shared memory allocators
or not. All of the classes that I'm familiar with that use an allocator
use it completely (like the stdlib containers).

Step back from your knowledge of the implementation and think about this
as a user of the rtree. Would you really expect the rtree to allocate
space from the heap, after giving it an allocator?

The rtree shouldn't second guess the intentions of the users here and
allocate memory from the heap for a temporary structure. There are
plenty of important uses cases to use other kinds of allocators that
don't have anything to do with shared or mapped memory: debugging
allocators, small object allocators, memory pools, etc. What makes this
a bug is that it is a surprise. It probably would have gone unnoticed
by me but for the operating system killing my testbed. :)

The need for this heap allocated vector appears to be packing related,
would it make sense to expose the packing algorithm as a free function
like the std::transform in <algorithm> instead? It could even take a
callable object, just like transform does (it'd be nice to reuse the
IndexableGetter there, if that's what it needs).

rtree::pack( srcBegin, srcEnd, packedBegin, packedEnd );

where srcBegin and srcEnd are random access, const_iterators and
std::distance(srcBegin,srcEnd) == std::distance(packedBegin, packedEnd).

Or an output iterator:
rtree::pack<Type, Type, Type>( srcBegin, srcEnd, packedIter );

Then the user would be responsible for allocating the packed container
and the resultant packing what would be passed in as a ctor (or a
factory method that has some sort of name that says "packed"):

auto rtree =
bgi::rtree<ValueType,Partitioner,Indexer>make_packed_rtree(packedBegin,
packedEnd);

// Or, the constructor
auto rtree = bgi::rtree<ValueType,Partitioner,Indexer>(packedBegin,
packedEnd);

That gives the flexibility of allocation without compromising the API.
I think it is cleaner to make a bulk loaded rtree construction more
explicit (vs. insert) and it removes the "surprise" of allocation that
packing requires.

>
> But would it solve your problem?

Yes. My test of 50,000,000 points died because it ran out of memory
(this is a "small" test). It was a big surprise to see that the rtree
needed to allocate outside of the allocator and not use the functor to
derive what it needed. I'm not focused on persistence, that was a means
to an end. My overall goal is to give my users the query ability of
data too large to fit in memory. If the rtree only stored the ValueType
as implied in the documentation and only used the allocater, as implied
by the API...
>
> Alterntively a second allocator could be passed for this temporary
> container, so a second, temporary file could be used. This would
> probably be the most convenient for you, am I right? However this
> would require to complicate the interface (constructors) so I don't
> like it.
I agree that putting yet another allocator in the constructor makes it
even more complex than it should be. Does the rtree really need to keep
the centroids of all of the data while it is packing? Couldn't those be
derived from calling the passed in IndexableGetter? If not, then maybe
getting the centroid is fundamental enough to deserve its own functor,
or a some other way to model that supports indexing, finding the
centroid and equality?

If the packing algorithm really needs the temporary storage, then I
think that the constructor is doing too much and the functionality that
is packing should be made a free function and the rtree's constructor
made simpler.

>
> We should definietly think about it when implementing a "proper"
> persistance support.
Yeah, persistence support is another thing entirely. I'm talking about
the happy case of memory based allocation right now - rtree is not
honoring its API contract with regard to allocators. The harder problem
of persistence is another story, but yes, that would be grand to have
all that too!

I'll take a look to see if I can't figure out if any of my ideas make
any sense. I don't know the domain well, but I do have a lot of C++
experience. I'll let you know either way.

Best,
Tim


Geometry list run by mateusz at loskot.net