Boost logo

Geometry :

Subject: Re: [geometry] Crash in rtree after growing shared memory, VS2012 x64, Boost 1.55
From: Tim Finer (tim.finer_at_[hidden])
Date: 2014-11-25 13:09:37


Hi Adam,

On 11/25/14, 6:04 AM, Adam Wulkiewicz wrote:
> Hi Tim,
>
> Tim Finer wrote:
>> On 11/21/14, 4:59 PM, Adam Wulkiewicz wrote:
>>> Tim Finer wrote:
>>>> It's been a few months, but I finally got back to trying out the
>>>> bgl rtree with a resizable memory mapped file. Things are looking
>>>> up and I have another question.
>>>>
>>>> I've edited our correspondence down to the essentials:
>>>>
>>>> On 7/26/14, 4:08 AM, Adam Wulkiewicz wrote:
>>>>> Tim Finer wrote:
>>>>>>> On Jul 25, 2014, at 6:16 PM, Adam Wulkiewicz
>>>>>>> <adam.wulkiewicz_at_[hidden]> wrote:
>>>>>>>
>>>>>>> In order to work the way you want the rtree requires an
>>>>>>> allocator which will be able to increase the size of shared
>>>>>>> memory if needed.
>>>>>>> I'm not sure if this is possible in general and/or with
>>>>>>> Interprocess.
>>>>>>> I'll try to play with it in my free time.
>>>>>>> If you were able to come up with something before me, please let
>>>>>>> me know.
>>>>>> Some other options I'm looking at are:
>>>>>>
>>>>>> 1 Scanning all the data first (it isn't in a convenient format,
>>>>>> so I'll need to store it in another mapped file), then allocating
>>>>>> enough space for the rtree.
>>>>>
>>>>> FYI, the packing algorithm done in the constructor (when you're
>>>>> creating the rtree from e.g. a pair of iterators) should construct
>>>>> an rtree having nodes "packed" as tightly as possible, you could
>>>>> actually predict what'll be the actual number of nodes created. At
>>>>> least for currently used algorithm. If you're curious see line 109
>>>>> of
>>>>> https://github.com/boostorg/geometry/blob/develop/include/boost/geometry/index/detail/rtree/pack_create.hpp.
>>>>> So maybe it'd be a good idea to first load the objects to a
>>>>> vector, calculate their size, then create a shared memory and
>>>>> construct an rtree using pack-create.
>>>>> Though I fear that if the memory could be fragmented this might
>>>>> not work in all cases.
>>>>>
>>>> I want to avoid a first pass and would prefer to trade space for
>>>> time. Is there a way that I could "actually predict what'll be the
>>>> actual number of nodes created" given a live rtree? I'd like to do
>>>> this based upon the currently populated rtree so that I could find
>>>> the worst case scenario and allocate at least that much memory in a
>>>> node insertion wrapper.
>>>>
>>>
>>> I think it could be possible. Since 1.56 or 1.57 (I don't remember)
>>> no temporary containers should be created using passed allocator,
>>> only nodes. In the worst case the rtree would have to create 1 leaf
>>> node + N = depth new internal nodes per insert. Unfortunately there
>>> is no official way of checking the depth. You should be able to do
>>> it like this:
>>>
>>> bgi::detail::rtree::utilities::view<rtree_type>(rtree_obj).depth();
>>>
>>> but it's non-official and could be removed/changed in the future.
>> Thank you for the hint - having a non-official function is better
>> than nothing. I'm using 1.55, and depth() is in there.
>>
>> Loading time is a lot faster when I create a memory mapped vector,
>> insert all the points, then create an rtree with the vector's
>> iterators (using the packing you mentioned before). Not only does
>> this generate a smaller rtree, it also is faster at returning queries!
>>
>
> The above is for one call of a single-value insert().
>
> In the case of packing algorithm you can predict a number of all nodes
> (as I probably wrote earlier), see:
> https://github.com/boostorg/geometry/blob/develop/include/boost/geometry/index/detail/rtree/pack_create.hpp
>
> line 109.
>

I did read that, except the formula wasn't evident. I was able to
figure out what I needed to know.

>> Your mention about the implementation change brings up a question:
>> how stable is the internal arrangement of nodes from boost version to
>> version?
>>
>> If I create persistent memory mapped rtrees with version 1.55, will
>> allocators from 1.56 or 1.57 still map them correctly into memory?
>> I'm in the middle of implementing all this so my customers can use as
>> a means of loading very large spatial datasets. If the memory layout
>> changes (in a breaking way) from boost version to boost version,
>> that'll mean a lot more work that I can live with but only if it is
>> easy to detect and mitigate robustly.
>>
>
> The safest approach would be to never use previously created shared
> files with the new version of the library. Some bugs might be fixed,
> features added, not only in Geometry but also in Interprocess or any
> other library on which it depend, etc. And we're talking about raw
> data here stored in a native format. This isn't serialization.
>
> In 1.55 polymorphic nodes was used, and they didn't work properly with
> shared memory. You shouldn't use this version.
> Since 1.56 variant-based nodes are used.
>
> In the future I plan to replace them with lighter type of nodes, so
> less memory should be needed, but the representation would change.
I see. I understand that this isn't serialization, but isn't this an
obvious use case? My understanding is that rtrees work really well as
disk based storage, implying memory mapped files (along with samples
that explain how to do this). The size of the data sets my customers
have are too large to fit in memory. It takes several minutes on fast
systems (with SSDs and lots of RAM) to create an rtree of millions of
points. I don't want to make my users do this every time they open a
document. My goal was to optimize the data once in an rtree and then
memory map the rtree many times after that. If boost geometry
rearranges the rtree internally with every release, this makes it much
more difficult to reuse.

Please add a warning in the documentation for the rtree in the memory
mapped samples about this? I can't be the only one that can't afford to
regenerate rtrees on the fly and is thinking of or attempting to reuse a
memory mapped version.

I see that you actually added experimental serialization support over a
year ago, but it looks like it supports trees than are read completely
into memory, and not read on demand? I'm more concerned with
persistence than serialization.

Thanks for the heads up about the polymorphic node bug, where is that
documented? If it isn't, that would also be something really helpful to
include. I don't know anything about how boost handles this, but it
would be helpful.

As much as I like the performance of the bg rtree, I'll probably switch
to an rtree implementation that has persistence as part of its design.
Boost isn't written so that different versions can easily be used by an
application, so this doubly compounds the problem.

>
>>>
>>>>>
>>>>>> 3. Monitor the capacity of the shared file and proactively
>>>>>> resize given a threshold (say 90%). I just thought of this one
>>>>>> and I like it because it sidesteps the entire exception problem
>>>>>> altogether. The more I think about it, the more I like this one.
>>>>>
>>>>> Yes, this is also worth trying.
>>>>> It seems that there is a get_free_memory() method in
>>>>> managed_shared_memory so the amount of free space could be easily
>>>>> returned.
>>>>> In case if the memory could be fragmented you could try to
>>>>> allocate some bigger block and see if it succeeds. Then resize the
>>>>> memory if necessary.
>>>>> When the exception is thrown get_free_memory() returns 968 in my
>>>>> test and I'm starting from 1024 * 64.
>>>>> The author of Boost.Interprocess could probably give us some
>>>>> pointers about how big the threshold should be and if the
>>>>> fragmentation should be taken into account.
>>>>>
>>>>
>>>> Hey! This worked. I arbitrarily chose 90% and resized at that
>>>> point and my initial tests work! What I'd like to do know is
>>>> figure out how much memory would require. I'm not terribly worried
>>>> about fragmentation (yet), but will get to that after I run some
>>>> tests with finding appropriate thresholds.
>>>
>>> So you probably would also need a size of nodes for the esstimate.
>>> Unfortunately it depends on the node type and there is no official
>>> way of obtaining it.
>>>
>>> So again, everything described below is non-official and may be
>>> changed in the future.
>>>
>>> Currently (1.57) Boost.Variant-based nodes are used by default (this
>>> may change in the future, in fact I plan to do it at soem point).
>>> Static nodes stores varrays/static_vectors, dynamic nodes uses
>>> boost::container::vector<>. So in the case of the
>>> container::vector<> the ammount of memory would also depend on the
>>> allocation strategy of this container. AFAIR this implementation of
>>> vector<> allocates only as much as it needs but I'm not sure.
>>>
>>> For Variant-based static nodes you should be able to check the size
>>> like this:
>>> sizeof(
>>> bgi::detail::rtree::utilities::view<rtree_type>::allocators_type::node_allocator_type::value_type
>>> )
>>>
>>> For dynamic nodes it'd be more complicated since you should add the
>>> container::vector<> storage which'd be in the worst case equal to
>>> N*sizeof(Value) for leafs and around N*sizeof(std::pair<Box,
>>> NodePointer>) for internal nodes. AFAIR in the worst case N should
>>> be equal to MAX+1-MIN in the newly created nodes. Box would be of
>>> type model::box<model::point<>> using the same coordinate type and
>>> dimension as the Indexable. NodePointer would be a type returned by
>>> allocator which you could get like this:
>>> bgi::detail::rtree::utilities::view<rtree_type>::allocators_type::node_pointer
>>>
>>>
>>>
>>> Because of all of this non-official stuff I'd consider it rather a
>>> workaround than a solution.
>>>
>>> Have you thought about the implementation of an allocator managing
>>> arbitrary number of Boost.Interprocess allocators/shared files? It
>>> could own some number of BI allocators and try to create a block of
>>> memory using them and if it wasn't able to do it it could create
>>> another file.
>>
>> I'm hoping that my using of a fixed node type means that I can
>> predict the size of the nodes using the space they actually take (by
>> keeping track of the allocator's statistics). So far, the overhead
>> is roughly 8 bytes per node over the size of the data stored when I
>> use the method above (create vector, use packing algorithm to create
>> an rtree).
>
> Btw, above I wrote about a single-value insert(). If you're using only
> pack-create the size could be calculated with more precision, probably
> as you're doing already.
>
> Yes, static-size nodes should always take the same amount of memory
> and you can get the size of a node using the method described above.
>
Thank You,
Tim


Geometry list run by mateusz at loskot.net