Boost logo

Geometry :

Subject: Re: [geometry] Storage allocator questions
From: Adam Wulkiewicz (adam.wulkiewicz_at_[hidden])
Date: 2014-12-01 11:20:13


Hi Henry,

Henry Roeland wrote:
> It would be greet if I can focus on storage allocator and leave the
> rtree part to you or somebody else.
>

Ok, let's do it this way. FYI, I won't have time this week to work on
this. But at least I'll answer on this email briefly.

> I hope I can still ask you guys questions concerning this storage
> allocator? I have some global Idea's and probably need help both on
> design level and implementation.
>
>
> Questions so far:
>
> 1. Is paging the only mechanism to get memory growth under control for
> an rtree? Its the only one I can think of, but maybe there are other
> ways to explore?
>

For the first version I think that our design should be as generic as
possible:
- the Storage concept should be an abstraction of a persistent storage,
just like an Allocator is an abstraction of "directly-accessible" memory.
- the rtree would request something or notify the storage about some
events but it wouldn't have to know anything about the internal
structure, techniques, heuristics, etc. E.g. the rtree wouldn't know
anything about paging.
- the interfaces should probably be as compact as possible, e.g. as I
said before, I think that in addition to the Allocator's interface we'd
need 2 functions:
   * storage_traits<StorageOrAllocator>::get_pointer(ptr) - notify
the Storage that data pointed by a pointer is needed
   * storage_traits<StorageOrAllocator>::release_pointers() - notify the
Storage that pointers aren't used anymore
- the storage probably shouldn't write the data directly to persistent
part, instead it should keep the modified data cached and then write
them all in one go. It could depend on some heuristic, the Storage could
write changes when release_pointers() was called or it could define
flush() method which could be called explicitly by the user. Still, the
changes of persistent data should be done only if all pointers was
released. At all times the storage could choose to keep the data cached.
It could choose to unload all of the nodes, or only some less frequently
used ones.

Is it reasonable?

>
> 2. Is there any way an rtree can be read-only? Or set read-only after
> (bulk)insert?
>

It could be done either in run- or in compile-time.
The most obvious way would be to throw a run-time exception in the
get_pointer() after some flag (rtree created) was set.
Another possibility could be to disable rtree mutable methods like
insert() and remove() in compile-time if some specific parameters or
Storage/Allocator was passed.
Do you have some specific use-case in mind?

>
> 3. One problem/challenge is how to get and keep region(box)
> coordinates "divided" over the page nodes. At least thats my
> understanding of an (r)tree. You need the region coordinates as fast
> lookup ID to know if a query should "dive" into the page or its sub
> pages or not.
> Crazy idea: Maybe the allocator should manage an rtree of pages with
> references/pointers to the "real" tree nodes?
>
> Questions:
>
> * Is my understanding of data lookup right? So coordinates determine
> the location of the data inside the RTree?
> * If this is true does this then mean that all Pages (Loaded or
> unloaded) must be known including its region coordinates?
> * This in order to get/keep the tree balanced and know which page
> must be loaded (by coordinate query)?
>

Each node internal node (in a classic R-tree) is a container of (Box,
Ptr) pairs. During any traversal (insert, query, etc.) the rtree
probably should load the whole node to have access to all of the Boxes
of children regions. AFAIR all traversing algorithms access all Boxes.
Then it chooses which children should be traversed and it traverses into
the next level using the corresponding pointers.

>
> Lets discuss the different actions an storage allocator can/must handle:
>
> * Load in Page data on request by RTree
> o "Load" can mean anything: From File, From DB, From
> MessageQueue,...
>
So the load would potentially be done in the call to get_pointer().
AFAIU the Storage would have to check if the underlying data (identified
by Ptr storing some ID or handle) isn't already loaded/cached. If it was
it'd just return the Ptr, if not then it'd:
- load a file,
- (maybe somewhere cache the raw file),
- create a node from loaded data and keep it,
- return the pointer.

So in my scenario (release_pointers()) it'd have to keep some container
of returned pointers and when release_pointers(), flush() was called
then just iterate over them and save changed nodes or create new pages,
etc. It should probably also be possible to search the specific pointer
in this container, so it should probably be map or unordered_map, or
sorted vector, etc. Or do you have other ideas?

E.g. when construct() was called the Storage probably shouldn't create a
Page or other persistent data, it should be created later, e.g. on
flush(). The same with destroy().

Btw, the Storage/Allocator would have to support rebinding and creation
of various types, not only nodes. It'd have to know if a type is a node
or some other type. So some types would be handled differently than
other types. I'm not sure if this is a good or a bad thing. I guess it's
possible even with std::allocator<> to specialize it for some specific type.

> o "Request" means query hits coordinate region that matches the
> Page coordinates
>
This isn't needed, the rtree would know exactly which child node must be
accessed becaused the Ptr is bound with the region in the internal node.
>
> * Bulk Load multiple pages at once???
>
>
Hmm, it could be done and potentially usable. Though it'd complicate the
storage_traits (1 additional methog/overload). But do you think that we
could gain anything from that? I mean, various pages would have to be
loaded and cached anyway. This could be added later if needed.
>
> * Unload Page data on request by Storage Manager itself(?)
> o "Unload" can mean:
> + To /dev/null :-) Only useful when read-only rtree
> + To File: Memory mapped or otherwise
> + To DB: Save data to (spatial) database
> + To MessageQueue
> + ???
> o "Request" can mean:
> + Smart_pointer(s) say that nodes inside page are not used
> anymore
> + Smart_pointer(s) say that page is not used anymore
> + Amount of usage(hit count) is small
> + ???
> * Bulk Unload multiple pages at once???
>
>

So in my scenario the rtree would notify the Storage that all pointers
are released and that the Storage may do whatever it like with them. So
it could save changed nodes, unload some of them, etc.

Hmm, the Storage also probably should know that a node was modified, or
that pointers are released after mutable operation like insert() or
remove(). So e.g. in the first version after an insert() the Storage
could just save all nodes.

> NOTE: The storage allocator must, as Adam already pointed out, deliver
> an interface to make above points possible. But not implement them in
> anyway.
>

So as mentioned above I was thinking about storage_traits which could be
specialized for some StorageOrAllocator. Some member functions would be
empty for Allocators but for Storage would have to be implemented. They
would probably call some Storage's methods which could be called
whatever the implementor liked.

> Other questions/points that come to my mind are:
>
> * Must the storage allocator always store? What about transactions
> between in memory (RTree) and Persistence storage?
> * State-full allocator: Who owns the nodes and the data inside it? I
> always thought of an allocator like a factory that does not own
> the data...
> * When data is owned by another STL(like) container then IMHO a
> storage allocator has no use. This because the storage allocator
> then does not own the data and has no means to free/unload it.
>
>

I see 2 ways how the ownership could be implemented:
1. Similar to Boost.Interprocess where there is a memory Manager owning
the data and AFAIU an Allocator only wrapping some reference to this
Manager. So AFAIU (I didn't check the code) the Allocator is only an
interface between a Container and a Manager. This simplifies the copying
of an Allocator, rebinding, etc.
2. An Allocator which is in the same time a Manager or rather owns a
Manager, so also owns the data. This'd probably require that the
Allocator would have to store a shared_pointer to some Manager
internally. An instance of a Manager would be created under the hood and
a pointer to it would be copied automatically.

1 is more clear but it'd require to design not only the interface for a
Storage/Allocator but also for a Manager, like in the case of
Interprocess. 2 would only require to design the interface of an
Allocator/Storage. If we choose 1 we could divide the logic into 2
Concepts, the Allocator could implement the in-memory-part (pointers,
memory allocation, construction, destruction, etc.) and the Manager the
persistent part (some file IDs or handles, saving and loading files,
nodes serialization, etc.). So as in the case of Interprocess we could
have 1 Allocator and could have many Managers.

In general I propose to work iteratively, that is at the beginning
support minimal set of required operations in the Storage/Allocator to
be able to e.g. insert() values and perform a query() in the most
simple, probably inefficient way, but with a simple and elegant
interface. And then optimize operations and extend the interface. Or
would you prefer to design the whole Storage/Allocator theoreticaly?

Regards,
Adam



Geometry list run by mateusz at loskot.net