|
Geometry : |
Subject: Re: [geometry] Storage allocator questions
From: Henry Roeland (henry_at_[hidden])
Date: 2014-11-30 06:15:57
Hi Tim,
> On 29 nov. 2014, at 20:03, Tim Finer <tim.finer_at_[hidden] <mailto:tim.finer_at_[hidden]>> wrote:
>
> Hi Henry,
>
> On 11/29/14, 10:48 AM, Henry Roeland wrote:
>> Hi Adam,
>>
>> It would be greet if I can focus on storage allocator and leave the rtree part to you or somebody else.
>>
>> I hope I can still ask you guys questions concerning this storage allocator? I have some global Ideas and probably need help both on design level and implementation.
>>
>>
>> Questions so far:
>>
>> 1. Is paging the only mechanism to get memory growth under control for an rtree? Its the only one I can think of, but maybe there are other ways to explore?
>>
>>
>> 2. Is there any way an rtree can be read-only? Or set read-only after (bulk)insert?
>>
>>
>> 3. One problem/challenge is how to get and keep region(box) coordinates divided over the page nodes. At least thats my understanding of an (r)tree. You need the region coordinates as fast lookup ID to know if a query should dive into the page or its sub pages or not.
>> Crazy idea: Maybe the allocator should manage an rtree of pages with references/pointers to the real tree nodes?
>>
>> Questions:
>> Is my understanding of data lookup right? So coordinates determine the location of the data inside the RTree?
>> If this is true does this then mean that all Pages (Loaded or unloaded) must be known including its region coordinates?
>> This in order to get/keep the tree balanced and know which page must be loaded (by coordinate query)?
>>
>>
>> Lets discuss the different actions an storage allocator can/must handle:
>> Load in Page data on request by RTree
>> Load" can mean anything: From File, From DB, From MessageQueue,
>> Request" means query hits coordinate region that matches the Page coordinates
>> Bulk Load multiple pages at once???
>>
>> Unload Page data on request by Storage Manager itself(?)
>> Unload can mean:
>> To /dev/null :-) Only useful when read-only rtree
>> To File: Memory mapped or otherwise
>> To DB: Save data to (spatial) database
>> To MessageQueue
>> ???
>> Request can mean:
>> Smart_pointer(s) say that nodes inside page are not used anymore
>> Smart_pointer(s) say that page is not used anymore
>> Amount of usage(hit count) is small
>> ???
>> Bulk Unload multiple pages at once???
>>
>> NOTE: The storage allocator must, as Adam already pointed out, deliver an interface to make above points possible. But not implement them in anyway.
>>
>> Other questions/points that come to my mind are:
>> Must the storage allocator always store? What about transactions between in memory (RTree) and Persistence storage?
>> State-full allocator: Who owns the nodes and the data inside it? I always thought of an allocator like a factory that does not own the data
>> When data is owned by another STL(like) container then IMHO a storage allocator has no use. This because the storage allocator then does not own the data and has no means to free/unload it.
>>
>>
>> Above is going top down by requirements/features but code-wise I will try to handle in my next e-mail.
>>
>> Thanks for your time,
>>
>> Kind regards,
>>
>> Henry Roeland
>>
>>
>>> On 26 nov. 2014, at 22:12, Adam Wulkiewicz <adam.wulkiewicz_at_[hidden] <mailto:adam.wulkiewicz_at_[hidden]>> wrote:
>>>
>>> Hi Henry,
>>>
>>> Henry Roeland wrote:
>>>>
>>>>> On 26 nov. 2014, at 17:48, Adam Wulkiewicz <adam.wulkiewicz_at_[hidden] <mailto:adam.wulkiewicz_at_[hidden]>> wrote:
>>>>>
>>>>> Henry Roeland wrote:
>>>>>> Dear all,
>>>>>>
>>>>>> First let me introduce myself: my name is Henry Roeland 37 years old and currently working as software engineer for a company that delivers maps to car companies like BMW and VW.
>>>>>> I'm working in the testtools team that is responsible for a map viewer and maptest batch tool.
>>>>>> Technics that we use are C/C++, Sqlite, spatialite and of corse boost.
>>>>>>
>>>>>> Personally I have multiple years of C++ experience and busy getting up to speed with geometry technics in 2D. NeXT to that just started with studying advanced datastructures and algoritms like b+tree and r*tree.
>>>>>>
>>>>>> Currently I see your rtree as a perfect candidate for in memory container/index/cache in our map interface. As I wrote in my previous email the feature that is not supported yet is paging. How can I help to accomplish this feature? Are there specs? Or am I the only one in favior of this feature?
>>>>> Certainly not the only one. People are asking about it from time to time, using Interprocess' mapped file as a replacement. You're the only one who is willing to help with the implementation. I greately appreciate it.
>>>>>
>>>>>
>>>>> As I mentioned before I was thinking about a stateful allocator/storage/manager handling file or files/pages, cacheing etc. This allocator would return pointers which when dereferenced would force the allocator to load and cache data. Then, there would be a mechanism to tell the allocator that some pointers/data aren't used anymore. It could be done explicitly by calling some function on allocator like:
>>>>>
>>>>> persistent_allocator alloc(...);
>>>>> rtree<..., persistent_allocator> rt(..., alloc);
>>>>>
>>>>> rt.insert(...);
>>>>> rt.insert(...);
>>>>> rt.insert(...);
>>>>> alloc.flush();
>>>>>
>>>> I dont know about an rtree but from my homework:-) concerning b-trees I understood that pages can be handled as nodes themselves. Pages can be split and contain multiple nodes depending on size. If we could create a stable interface for Page(Nodes) then this interface can become public leaving the Node interface hidden???
>>>> It can also have other advantages to alloc a group of nodes (Page) together in terms of memory alignment(Your dont need a smart pointer for every node only for the page?), Also in terms of (de)serialisation and bulk load handling a group of nodes would IMHO be better (if its possible).
>>>
>>> Of course you're right, nodes could (or even should) be grouped in pages, loaded together etc. Also we should probably group close nodes together in one page (e.g. parents with children), allocator hint could be used for this. But I was thinking about leaving the desicion how to store nodes to the allocator/storage. As you said one could use big pages, one could store 1 node per file, or a database etc. And still somehow a page or some other storage or Archive must know what data should be saved/loaded per node. In my scenario this would just be implemented in a Serialization-like way. Higher level aspects like pages would be closed in allocator/storage. Of course assuming that this could be done this way but I feel that it could.
>>>
>>> I think that a pointer still would have to store some ID of a node inside a page to identify an exact placement of the data. Or at least some global ID somehow indexed to access a specific node in a specific page.
>>>
>>>>
>>>>> The next step could be adding of some hooks or traits for allocators which would be called by the rtree internally on the end of some operation, e.g. insert. So this wouldn't be a C++ allocator anymore since it'd have additional methods.
>>>>>
>>>>> To avoid storing a reference to allocator/storage in those smart pointers the rtree could notify the allocator/storage when a pointer is dereferenced and data is required. So the responsibility for the notification of allocator/storage would be moved from a pointer to the rtree.
>>>>>
>>>> The Page Node can administrate its size and its free size and have a smart_pointer telling when it was last used and dependency_checker /unload checker with responsibility for notification. Maybe we can use Boost Signals2 for notification?
>>>>
>>>>
>>>> Together with a Page/Storage manager Im thinking about the following points:
>>>> - Policy for when to load/unload. traits as you point it out.
>>>> - Size of rtree must be known at all times. If some limit is reached start unloading (allocator can tell the size?) to bottom pages that are less used?
>>>> - 25%/50%/75% percent of the top of the tree must remain in memory for speed???
>>>> - define 3 actions/states for a page: loaded, unloaded, unloaded but cached as memory map.????
>>>
>>> So what you're describing here is how a paging allocator/storage could work. Sure if the storage was notified by the rtree about the performed actions it'd be able to do whatever it needs, cache pages, store some kind of usage counter to know which pages are used most often, support various policies, etc.
>>>
>>> I was describing more low level stuff that would have to be implemented on the rtree side. Basically 2 things notifications for actions and an interface for serialization of nodes and rtree header. Or would you prefer if I handled it, that you could focus on the implementation of a specific allocator/storage?
>>>
>>> There is also another thing that is mentioned in another thread (started by Tim). There should be some way of creation non-rtree, temporary pages, e.g. for data stored in a vector or some other container designed specifically for this purpose. But this could be handled later.
>>>
>>>>
>>>>> Finally the rtree could notify the allocator/storage each time the data owned by a pointer is no longer needed but this probably wouldn't change anything. Furthermore modifying and releasing the nodes before an operation is finished wouldn't be safe in case when an exception was thrown. In fact, this allocator should keep a copy of nodes in memory during a whole operation and then at the end of the operation save the data into a persistent storage.
>>>>>
>>>>>
>>>>> The next thing is how to serialize data from and to node. We should ask ourselves should we allow the users to implement other persistent storage variants on their own. If the answer was no, then we could just internally in the persistent allocator use the internals of the rtree. But if we wanted to allow it (which IMO should be done) the most straightforward way would be to expose the interface used by the rtree internally to access node's elements. Exposing it would mean that we couldn't change it in the future, so I'd prefer to avoid it. Instead, we could mimic or directly use Boost.Serialization interface for Nodes. In this scenario a specific Node would know what data must be saved and loaded and it'd be able to save or load to/from some generic Archive. Depending on the needs this could be some temporary 1-node Archive gathering data for later or a target Archive capable to serialize data on the fly, it'd depend on an allocator/storage.
>>>>>
>>>>> This way we could also support versioning on a node level, the same way how it's done in Serialization. So changes would have to be done on a nodes level not in the user-defined allocator. An example could be an implementation of a rtree variant storing additional data in nodes (hilbert r-tree) or additional types of nodes (PRtree). Also arbitrary user-defined Values would be serialized the same way (using Serialization or familiar Serialization-like interface).
>>>>>
>>>>> And this way we'd also support Serialization in one go.
>>>>>
>>>>>
>>>>> We'd probably need some file header with the parameters of an rtree both in persistent storage and in serialization support. Similar as with nodes, an rtree could know how to load/save the header from/to some Archive. The rtree should e.g. check if it's possible to use persistent storage and load data, e.g. if it wasn't created using incompatible parameters, etc.
>>>>>
>>>>>
>>>> How is it with backwards compatibility? I can imagine that Storage work can break something already running?
>>>
>>> I think that this won't be problematic.
>>>
>>>> If Im talking rubbish just tell me I can handle it:-)
>>>>
>>>
>>> All ideas are valuable :)
>>>
>>>> Thanks for beneath info to getting me started. I will have to read it a few times to understand it all.
>>>
>>> If you have any questions just ask. If I handled it on the rtree side and you was working on the allocator/storage, you probably wouldn't even be forced to know all of this internal stuff. Not that I have something against it :)
>>>
>>> Regards,
>>> Adam
>
> I have a request that the allocator be the sole source of allocation. One of my problems with the current implementation is that one of the constructors creates a temporary vector of centroids as a source for a packing algorithm. My understanding is that the allocator needs to be generic enough to offer up a temporary buffer (something similar to std::get_temporary_buffer) so that the packing algorithm can take advantage of it without going to the heap.
>
> Best,
> Tim
> _______________________________________________
> Geometry mailing list
> Geometry_at_[hidden] <mailto:Geometry_at_[hidden]>
> http://lists.boost.org/mailman/listinfo.cgi/geometry <http://lists.boost.org/mailman/listinfo.cgi/geometry>
Do you mean that the storage allocator must also support creating temporary buffer(s)? And what about storing this temporary buffers?
I will take this into account. Thanks for pointing this out.
Kind regards,
Henry
Geometry list run by mateusz at loskot.net