Boost logo

Geometry :

Subject: Re: [geometry] Storage allocator questions
From: Tim Finer (tim.finer_at_[hidden])
Date: 2014-11-29 14:03:46


Hi Henry,

On 11/29/14, 10:48 AM, Henry Roeland wrote:
> Hi Adam,
>
> It would be greet if I can focus on storage allocator and leave the
> rtree part to you or somebody else.
>
> I hope I can still ask you guys questions concerning this storage
> allocator? I have some global Idea's and probably need help both on
> design level and implementation.
>
>
> Questions so far:
>
> 1. Is paging the only mechanism to get memory growth under control for
> an rtree? Its the only one I can think of, but maybe there are other
> ways to explore?
>
>
> 2. Is there any way an rtree can be read-only? Or set read-only after
> (bulk)insert?
>
>
> 3. One problem/challenge is how to get and keep region(box)
> coordinates "divided" over the page nodes. At least thats my
> understanding of an (r)tree. You need the region coordinates as fast
> lookup ID to know if a query should "dive" into the page or its sub
> pages or not.
> Crazy idea: Maybe the allocator should manage an rtree of pages with
> references/pointers to the "real" tree nodes?
>
> Questions:
>
> * Is my understanding of data lookup right? So coordinates determine
> the location of the data inside the RTree?
> * If this is true does this then mean that all Pages (Loaded or
> unloaded) must be known including its region coordinates?
> * This in order to get/keep the tree balanced and know which page
> must be loaded (by coordinate query)?
>
>
> Lets discuss the different actions an storage allocator can/must handle:
>
> * Load in Page data on request by RTree
> o "Load" can mean anything: From File, From DB, From
> MessageQueue,...
> o "Request" means query hits coordinate region that matches the
> Page coordinates
> * Bulk Load multiple pages at once???
>
>
> * Unload Page data on request by Storage Manager itself(?)
> o "Unload" can mean:
> + To /dev/null :-) Only useful when read-only rtree
> + To File: Memory mapped or otherwise
> + To DB: Save data to (spatial) database
> + To MessageQueue
> + ???
> o "Request" can mean:
> + Smart_pointer(s) say that nodes inside page are not used
> anymore
> + Smart_pointer(s) say that page is not used anymore
> + Amount of usage(hit count) is small
> + ???
> * Bulk Unload multiple pages at once???
>
>
> NOTE: The storage allocator must, as Adam already pointed out, deliver
> an interface to make above points possible. But not implement them in
> anyway.
>
> Other questions/points that come to my mind are:
>
> * Must the storage allocator always store? What about transactions
> between in memory (RTree) and Persistence storage?
> * State-full allocator: Who owns the nodes and the data inside it? I
> always thought of an allocator like a factory that does not own
> the data...
> * When data is owned by another STL(like) container then IMHO a
> storage allocator has no use. This because the storage allocator
> then does not own the data and has no means to free/unload it.
>
>
>
> Above is going top down by requirements/features but code-wise I will
> try to handle in my next e-mail.
>
> Thanks for your time,
>
> Kind regards,
>
> Henry Roeland
>
>
>> On 26 nov. 2014, at 22:12, Adam Wulkiewicz <adam.wulkiewicz_at_[hidden]
>> <mailto:adam.wulkiewicz_at_[hidden]>> wrote:
>>
>> Hi Henry,
>>
>> Henry Roeland wrote:
>>>
>>>> On 26 nov. 2014, at 17:48, Adam Wulkiewicz
>>>> <adam.wulkiewicz_at_[hidden] <mailto:adam.wulkiewicz_at_[hidden]>> wrote:
>>>>
>>>> Henry Roeland wrote:
>>>>> Dear all,
>>>>>
>>>>> First let me introduce myself: my name is Henry Roeland 37 years
>>>>> old and currently working as software engineer for a company that
>>>>> delivers maps to car companies like BMW and VW.
>>>>> I'm working in the testtools team that is responsible for a map
>>>>> viewer and maptest batch tool.
>>>>> Technics that we use are C/C++, Sqlite, spatialite and of corse boost.
>>>>>
>>>>> Personally I have multiple years of C++ experience and busy
>>>>> getting up to speed with geometry technics in 2D. NeXT to that
>>>>> just started with studying advanced datastructures and algoritms
>>>>> like b+tree and r*tree.
>>>>>
>>>>> Currently I see your rtree as a perfect candidate for in memory
>>>>> container/index/cache in our map interface. As I wrote in my
>>>>> previous email the feature that is not supported yet is paging.
>>>>> How can I help to accomplish this feature? Are there specs? Or am
>>>>> I the only one in favior of this feature?
>>>> Certainly not the only one. People are asking about it from time to
>>>> time, using Interprocess' mapped file as a replacement. You're the
>>>> only one who is willing to help with the implementation. I greately
>>>> appreciate it.
>>>>
>>>>
>>>> As I mentioned before I was thinking about a stateful
>>>> allocator/storage/manager handling file or files/pages, cacheing
>>>> etc. This allocator would return pointers which when dereferenced
>>>> would force the allocator to load and cache data. Then, there would
>>>> be a mechanism to tell the allocator that some pointers/data aren't
>>>> used anymore. It could be done explicitly by calling some function
>>>> on allocator like:
>>>>
>>>> persistent_allocator alloc(...);
>>>> rtree<..., persistent_allocator> rt(..., alloc);
>>>>
>>>> rt.insert(...);
>>>> rt.insert(...);
>>>> rt.insert(...);
>>>> alloc.flush();
>>>>
>>> I don't know about an rtree but from my homework:-) concerning
>>> b-trees I understood that pages can be handled as nodes themselves.
>>> Pages can be split and contain multiple nodes depending on size. If
>>> we could create a stable interface for Page(Nodes) then this
>>> interface can become public leaving the Node interface hidden???
>>> It can also have other advantages to alloc a group of nodes (Page)
>>> together in terms of memory alignment(Your don't need a smart
>>> pointer for every node only for the page?), Also in terms of
>>> (de)serialisation and bulk load handling a group of nodes would IMHO
>>> be better (if its possible).
>>
>> Of course you're right, nodes could (or even should) be grouped in
>> pages, loaded together etc. Also we should probably group close nodes
>> together in one page (e.g. parents with children), allocator hint
>> could be used for this. But I was thinking about leaving the desicion
>> how to store nodes to the allocator/storage. As you said one could
>> use big pages, one could store 1 node per file, or a database etc.
>> And still somehow a page or some other storage or Archive must know
>> what data should be saved/loaded per node. In my scenario this would
>> just be implemented in a Serialization-like way. Higher level aspects
>> like pages would be closed in allocator/storage. Of course assuming
>> that this could be done this way but I feel that it could.
>>
>> I think that a pointer still would have to store some ID of a node
>> inside a page to identify an exact placement of the data. Or at least
>> some global ID somehow indexed to access a specific node in a
>> specific page.
>>
>>>
>>>> The next step could be adding of some hooks or traits for
>>>> allocators which would be called by the rtree internally on the end
>>>> of some operation, e.g. insert. So this wouldn't be a C++ allocator
>>>> anymore since it'd have additional methods.
>>>>
>>>> To avoid storing a reference to allocator/storage in those smart
>>>> pointers the rtree could notify the allocator/storage when a
>>>> pointer is dereferenced and data is required. So the responsibility
>>>> for the notification of allocator/storage would be moved from a
>>>> pointer to the rtree.
>>>>
>>> The Page Node can administrate its size and its free size and have a
>>> smart_pointer telling when it was last used and dependency_checker
>>> /unload checker with responsibility for notification. Maybe we can
>>> use Boost Signals2 for notification?
>>>
>>>
>>> Together with a Page/Storage manager I'm thinking about the
>>> following points:
>>> - Policy for when to load/unload. traits as you point it out.
>>> - Size of rtree must be known at all times. If some limit is reached
>>> start unloading (allocator can tell the size?) to bottom pages that
>>> are less used?
>>> - 25%/50%/75% percent of the top of the tree must remain in memory
>>> for speed???
>>> - define 3 actions/states for a page: loaded, unloaded, unloaded but
>>> cached as memory map.????
>>
>> So what you're describing here is how a paging allocator/storage
>> could work. Sure if the storage was notified by the rtree about the
>> performed actions it'd be able to do whatever it needs, cache pages,
>> store some kind of usage counter to know which pages are used most
>> often, support various policies, etc.
>>
>> I was describing more low level stuff that would have to be
>> implemented on the rtree side. Basically 2 things notifications for
>> actions and an interface for serialization of nodes and rtree header.
>> Or would you prefer if I handled it, that you could focus on the
>> implementation of a specific allocator/storage?
>>
>> There is also another thing that is mentioned in another thread
>> (started by Tim). There should be some way of creation non-rtree,
>> temporary pages, e.g. for data stored in a vector or some other
>> container designed specifically for this purpose. But this could be
>> handled later.
>>
>>>
>>>> Finally the rtree could notify the allocator/storage each time the
>>>> data owned by a pointer is no longer needed but this probably
>>>> wouldn't change anything. Furthermore modifying and releasing the
>>>> nodes before an operation is finished wouldn't be safe in case when
>>>> an exception was thrown. In fact, this allocator should keep a copy
>>>> of nodes in memory during a whole operation and then at the end of
>>>> the operation save the data into a persistent storage.
>>>>
>>>>
>>>> The next thing is how to serialize data from and to node. We should
>>>> ask ourselves should we allow the users to implement other
>>>> persistent storage variants on their own. If the answer was no,
>>>> then we could just internally in the persistent allocator use the
>>>> internals of the rtree. But if we wanted to allow it (which IMO
>>>> should be done) the most straightforward way would be to expose the
>>>> interface used by the rtree internally to access node's elements.
>>>> Exposing it would mean that we couldn't change it in the future, so
>>>> I'd prefer to avoid it. Instead, we could mimic or directly use
>>>> Boost.Serialization interface for Nodes. In this scenario a
>>>> specific Node would know what data must be saved and loaded and
>>>> it'd be able to save or load to/from some generic Archive.
>>>> Depending on the needs this could be some temporary 1-node Archive
>>>> gathering data for later or a target Archive capable to serialize
>>>> data on the fly, it'd depend on an allocator/storage.
>>>>
>>>> This way we could also support versioning on a node level, the same
>>>> way how it's done in Serialization. So changes would have to be
>>>> done on a nodes level not in the user-defined allocator. An example
>>>> could be an implementation of a rtree variant storing additional
>>>> data in nodes (hilbert r-tree) or additional types of nodes
>>>> (PRtree). Also arbitrary user-defined Values would be serialized
>>>> the same way (using Serialization or familiar Serialization-like
>>>> interface).
>>>>
>>>> And this way we'd also support Serialization in one go.
>>>>
>>>>
>>>> We'd probably need some file header with the parameters of an rtree
>>>> both in persistent storage and in serialization support. Similar as
>>>> with nodes, an rtree could know how to load/save the header from/to
>>>> some Archive. The rtree should e.g. check if it's possible to use
>>>> persistent storage and load data, e.g. if it wasn't created using
>>>> incompatible parameters, etc.
>>>>
>>>>
>>> How is it with backwards compatibility? I can imagine that Storage
>>> work can break something already running?
>>
>> I think that this won't be problematic.
>>
>>> If I'm talking rubbish just tell me I can handle it:-)
>>>
>>
>> All ideas are valuable :)
>>
>>> Thanks for beneath info to getting me started. I will have to read
>>> it a few times to understand it all.
>>
>> If you have any questions just ask. If I handled it on the rtree side
>> and you was working on the allocator/storage, you probably wouldn't
>> even be forced to know all of this internal stuff. Not that I have
>> something against it :)
>>
>> Regards,
>> Adam

I have a request that the allocator be the sole source of allocation.
One of my problems with the current implementation is that one of the
constructors creates a temporary vector of centroids as a source for a
packing algorithm. My understanding is that the allocator needs to be
generic enough to offer up a temporary buffer (something similar to
std::get_temporary_buffer) so that the packing algorithm can take
advantage of it without going to the heap.

Best,
Tim



Geometry list run by mateusz at loskot.net