Boost logo

Geometry :

Subject: Re: [geometry] Storage allocator questions
From: Adam Wulkiewicz (adam.wulkiewicz_at_[hidden])
Date: 2014-12-08 10:12:31

Hi Henry,

Henry Roeland wrote:
> On 1 dec. 2014, at 17:20, Adam Wulkiewicz <adam.wulkiewicz_at_[hidden]
> <mailto:adam.wulkiewicz_at_[hidden]>> wrote:
>> Henry Roeland wrote:
>>> It would be greet if I can focus on storage allocator and leave the
>>> rtree part to you or somebody else.
>> Ok, let's do it this way. FYI, I won't have time this week to work on
>> this. But at least I'll answer on this email briefly.
> No problem.
Ok, we could get back to this if you wanted.

>>> I hope I can still ask you guys questions concerning this storage
>>> allocator? I have some global Idea's and probably need help both on
>>> design level and implementation.
>>> Questions so far:
>>> 1. Is paging the only mechanism to get memory growth under control
>>> for an rtree? Its the only one I can think of, but maybe there are
>>> other ways to explore?
>> For the first version I think that our design should be as generic as
>> possible:
> One vote for generic! :-)
>> - the Storage concept should be an abstraction of a persistent
>> storage, just like an Allocator is an abstraction of
>> "directly-accessible" memory.
>> - the rtree would request something or notify the storage about some
>> events but it wouldn't have to know anything about the internal
>> structure, techniques, heuristics, etc. E.g. the rtree wouldn't know
>> anything about paging.
> OK. But what about the other way around? Should the Storage/Allocator
> not know how the container is build up? This in order to choose
> efficiently how to store its data?

What things should be known by the allocator/storage? I'm guessing that
if at some point some additional info was needed then it would be easy
to add another function or definition to the storage_traits, so the
storage would know everything it needs.

>> - the interfaces should probably be as compact as possible, e.g. as I
>> said before, I think that in addition to the Allocator's interface
>> we'd need 2 functions:
>> * storage_traits<StorageOrAllocator>::get_pointer(ptr) - notify the
>> Storage that data pointed by a pointer is needed
>> * storage_traits<StorageOrAllocator>::release_pointers() - notify
>> the Storage that pointers aren't used anymore
>> - the storage probably shouldn't write the data directly to
>> persistent part, instead it should keep the modified data cached and
>> then write them all in one go. It could depend on some heuristic, the
>> Storage could write changes when release_pointers() was called or it
>> could define flush() method which could be called explicitly by the
>> user. Still, the changes of persistent data should be done only if
>> all pointers was released. At all times the storage could choose to
>> keep the data cached. It could choose to unload all of the nodes, or
>> only some less frequently used ones.
>> Is it reasonable?
> Yes it sounds reasonable. But are we not creating a generic storage
> allocator which also can be used on e.g. std::vector or other stl
> containers? I have no problem with this but I cannot imagine that we
> are the first...

To be honest, I haven't done an extensive research regarding the
existing solutions. However every solutions I saw are tightly bound with
the container itself. I think that in most cases the container that is
implemented is either in-memory or persistent. But the fact that I
haven't seen a persistent storage abstraction doesn't mean that it
doesn't exist. Though I'm not sure if it would be reasonable in all
cases. So sure, additional research could be done.

For the rtree we only need a storage capable to allocate the memory for
nodes. And we also need a way to create a big temporary storage,
probably divided into several files.

Though it would probably be possible to implement a persistent allocator
for std::vector it wouldn't be very efficient because a vector
reallocates the memory by creating another block and copying all of the
elements. Handling this problem could allow us to implement a heap-like
kd-tree storing the nodes in a contigeous storage. But then we'd be
forced to deal with additional problems like cacheing chunks of this big
persistent memory block, prefetching, etc.

> Other questions/points that popup:
> * What about relative pointers e.g. node vs. storage/page?

The pointer stored in the node in memory should uniquely define the
location of the data. It could store an ID of a file and an ID of a node
stored in that file or whatever. E.g. Boost.Interprocess uses offset
pointers containing an offset from the beginning of a memory block owned
by the allocator or memory manager, because next time the memory block
could be placed in memory at a different place.
> * What about memory management issues:
> o Fragmentation
It probably depends on the allocator/storge itself. I'm guessing that in
some cases it wouldn't be a problem, at least with the rtree. If one
file or page or whatever you'd like to call it contained only one node
then an allocator/storage would have to just find a unique name for this
file. So the fragmentation problem would be moved from the program to
the OS/Filesystem.
> o Alloc ahead
Do you have in mind a mechanism to create more persistent memory than is
needed now for the future? This could be used e.g. in the packing
algorithm. I'm guessing that it wouldn't be problematic to add this in
the future.
> o Garbage collection (We are going to do it already using smart
> pointer(s) but on low level?)
> + malloc libs like jemalloc and tcmalloc can do this on low
> level...
GC for what purpose? The rtree explicitly notifies an allocator if it
wants to distroy a node. So it would also notify a storage. So the
storage could store the request for node deletion/page removal and when
flush() was called perform the action together with other actions,
preferably in a transactional manner for safety.
> o Move pages/blocks: Pointers become invalid?
Pointers shouldn't be invalidated. An ID of a node would be stored in
parent node and it could be stored in memory in pointers contained in
the rtree or in the cache of the storage, etc. So I'm guessing that
you're asking about the in-memory part. If the allocator/storage wanted
to move pages then it could do it but e.g. in a way preserving the IDs.
Or would you like to also modify IDs in the pages containing parents?
> o Object/Page/Node pooling???
What do you mean?
> o Does STL have any guidelines restrictions on memory management?
The STL uses the Allocator concept for this, in C++11 stateful
allocators and std::allocator_traits<>. Temporary buffers are created
with std::get_temporary_buffer() or new (nothrow).
But this wouldn't be enough for us since some of the objects (nodes)
would have to be somehow serialized into disk. So I thought about
extending it with
- storage traits (functional) - for notification of the
storage/allocator about some additional actions
- storage traits (definitions) - e.g. for defining what must be
serialized (nodes), and what shouldn't
- serialization interface - implementing how nodes should be saved and
loaded using specific storage

Storage traits could be implemented as one struct as in the case of
std::allocator_traits<> or divided into separate structs as in the case
of Boost.Geometry traits (tag<>, dimension<>, etc.).

> * Is rtree thread safe? Should our Storage/Allocator be thread-safe?

The rtree is thread-safe more or less the same way as STL containers
are. Mutable actions aren't thread-safe (e.g. creation, insert, remove),
non-mutable should be (e.g. query).

So it should probably be possible to read data safely. If a
storage/allocator was fully thread-safe it could allow us in the future
to e.g. parallelize the persistent rtree creation. But let's keep it
simple for now :)

>>> 2. Is there any way an rtree can be read-only? Or set read-only
>>> after (bulk)insert?
>> It could be done either in run- or in compile-time.
>> The most obvious way would be to throw a run-time exception in the
>> get_pointer() after some flag (rtree created) was set.
>> Another possibility could be to disable rtree mutable methods like
>> insert() and remove() in compile-time if some specific parameters or
>> Storage/Allocator was passed.
>> Do you have some specific use-case in mind?
> At the company I'm working for :-) They use readonly maps a lot. So in
> the future(hopefully)rtree with storage/allocator as cache.

Ok, as I said, it could be done. This mechanism could be implemented in
the allocator itself, or the behavior could be defined in compile-time
in storage traits, etc.


>>> Other questions/points that come to my mind are:
>>> * Must the storage allocator always store? What about transactions
>>> between in memory (RTree) and Persistence storage?
>>> * State-full allocator: Who owns the nodes and the data inside it?
>>> I always thought of an allocator like a factory that does not
>>> own the data...
>>> * When data is owned by another STL(like) container then IMHO a
>>> storage allocator has no use. This because the storage allocator
>>> then does not own the data and has no means to free/unload it.
>> I see 2 ways how the ownership could be implemented:
>> 1. Similar to Boost.Interprocess where there is a memory Manager
>> owning the data and AFAIU an Allocator only wrapping some reference
>> to this Manager. So AFAIU (I didn't check the code) the Allocator is
>> only an interface between a Container and a Manager. This simplifies
>> the copying of an Allocator, rebinding, etc.
>> 2. An Allocator which is in the same time a Manager or rather owns a
>> Manager, so also owns the data. This'd probably require that the
>> Allocator would have to store a shared_pointer to some Manager
>> internally. An instance of a Manager would be created under the hood
>> and a pointer to it would be copied automatically.
>> 1 is more clear but it'd require to design not only the interface for
>> a Storage/Allocator but also for a Manager, like in the case of
>> Interprocess. 2 would only require to design the interface of an
>> Allocator/Storage. If we choose 1 we could divide the logic into 2
>> Concepts, the Allocator could implement the in-memory-part (pointers,
>> memory allocation, construction, destruction, etc.) and the Manager
>> the persistent part (some file IDs or handles, saving and loading
>> files, nodes serialization, etc.). So as in the case of Interprocess
>> we could have 1 Allocator and could have many Managers.
>> In general I propose to work iteratively, that is at the beginning
>> support minimal set of required operations in the Storage/Allocator
>> to be able to e.g. insert() values and perform a query() in the most
>> simple, probably inefficient way, but with a simple and elegant
>> interface. And then optimize operations and extend the interface. Or
>> would you prefer to design the whole Storage/Allocator theoreticaly?
> I prefer both :-) I will try to keep an UML overview with our
> suggestions to keep an theoretical overview. Next to that its
> probably already time to dive into code to see what is usable and what
> not.
> Pratical question: Should I fork BoostGeometry on github or do you
> prefer a branch?

You should create your fork of Boost.Geometry on GitHub and create a
branch originating in develop in your fork.
Here is a tutorial:

At Boost we're using GitFlow branching model so your branch should
originate in develop and be called feature/xxx, e.g. feature/persistency
or feature/persistent_allocator, or something like that. So you'd be
able to add your code in this branch and if something on the rtree side
was needed I'd add it to develop (or you could create a PullRequest) and
you'd be able to synchronize your branch (you could just add it in your
branch). At the end you'd prepare a PullRequest against the develop
branch. Then we could review your code and merge it. Or you could create
a few smaller PRs in the process.


Geometry list run by mateusz at