On 1 dec. 2014, at 17:20, Adam Wulkiewicz <adam.wulkiewicz@gmail.com> wrote:
Henry Roeland wrote:
It would be greet if I can focus on
storage allocator and leave the rtree part to you or
somebody else.
Ok, let's do it this way. FYI, I won't have time this
week to work on this. But at least I'll answer on this
email briefly.
No problem.
Ok, we could get back to this if you wanted.
I hope I can still ask you guys
questions concerning this storage allocator? I have
some global Idea’s and probably need help both on
design level and implementation.
Questions so far:
1. Is paging the only mechanism to get
memory growth under control for an rtree? Its the
only one I can think of, but maybe there are other
ways to explore?
For the first version I think that our design should be
as generic as possible:
One vote for generic! :-)
- the Storage concept should
be an abstraction of a persistent storage, just like an
Allocator is an abstraction of "directly-accessible"
memory.
- the rtree would request something or notify the
storage about some events but it wouldn't have to know
anything about the internal structure, techniques,
heuristics, etc. E.g. the rtree wouldn't know anything
about paging.
OK. But what about the other way around? Should the
Storage/Allocator not know how the container is build up?
This in order to choose efficiently how to store its data?
What things should be known by the allocator/storage? I'm guessing
that if at some point some additional info was needed then it would
be easy to add another function or definition to the storage_traits,
so the storage would know everything it needs.
- the interfaces should
probably be as compact as possible, e.g. as I said
before, I think that in addition to the Allocator's
interface we'd need 2 functions:
*
storage_traits<StorageOrAllocator>::get_pointer(ptr)
- notify the Storage that data pointed by a pointer is
needed
*
storage_traits<StorageOrAllocator>::release_pointers()
- notify the Storage that pointers aren't used anymore
- the storage probably shouldn't write the data directly
to persistent part, instead it should keep the modified
data cached and then write them all in one go. It could
depend on some heuristic, the Storage could write
changes when release_pointers() was called or it could
define flush() method which could be called explicitly
by the user. Still, the changes of persistent data
should be done only if all pointers was released. At all
times the storage could choose to keep the data cached.
It could choose to unload all of the nodes, or only some
less frequently used ones.
Is it reasonable?
Yes it sounds reasonable. But are we not creating a
generic storage allocator which also can be used on e.g.
std::vector or other stl containers? I have no problem with
this but I cannot imagine that we are the first…
To be honest, I haven't done an extensive research regarding the
existing solutions. However every solutions I saw are tightly bound
with the container itself. I think that in most cases the container
that is implemented is either in-memory or persistent. But the fact
that I haven't seen a persistent storage abstraction doesn't mean
that it doesn't exist. Though I'm not sure if it would be reasonable
in all cases. So sure, additional research could be done.
For the rtree we only need a storage capable to allocate the memory
for nodes. And we also need a way to create a big temporary storage,
probably divided into several files.
Though it would probably be possible to implement a persistent
allocator for std::vector it wouldn't be very efficient because a
vector reallocates the memory by creating another block and copying
all of the elements. Handling this problem could allow us to
implement a heap-like kd-tree storing the nodes in a contigeous
storage. But then we'd be forced to deal with additional problems
like cacheing chunks of this big persistent memory block,
prefetching, etc.
Other questions/points that popup:
- What about relative pointers e.g. node vs.
storage/page?
The pointer stored in the node in memory should uniquely define the
location of the data. It could store an ID of a file and an ID of a
node stored in that file or whatever. E.g. Boost.Interprocess uses
offset pointers containing an offset from the beginning of a memory
block owned by the allocator or memory manager, because next time
the memory block could be placed in memory at a different place.
- What about memory management issues:
It probably depends on the allocator/storge itself. I'm guessing
that in some cases it wouldn't be a problem, at least with the
rtree. If one file or page or whatever you'd like to call it
contained only one node then an allocator/storage would have to just
find a unique name for this file. So the fragmentation problem would
be moved from the program to the OS/Filesystem.
Do you have in mind a mechanism to create more persistent memory
than is needed now for the future? This could be used e.g. in the
packing algorithm. I'm guessing that it wouldn't be problematic to
add this in the future.
- Garbage collection (We are going to do it
already using smart pointer(s) but on low level?)
- malloc libs like jemalloc and tcmalloc
can do this on low level…
GC for what purpose? The rtree explicitly notifies an allocator if
it wants to distroy a node. So it would also notify a storage. So
the storage could store the request for node deletion/page removal
and when flush() was called perform the action together with other
actions, preferably in a transactional manner for safety.
- Move pages/blocks: Pointers become invalid?
Pointers shouldn't be invalidated. An ID of a node would be stored
in parent node and it could be stored in memory in pointers
contained in the rtree or in the cache of the storage, etc. So I'm
guessing that you're asking about the in-memory part. If the
allocator/storage wanted to move pages then it could do it but e.g.
in a way preserving the IDs. Or would you like to also modify IDs in
the pages containing parents?
- Object/Page/Node pooling???
What do you mean?
- Does STL have any guidelines restrictions
on memory management?
The STL uses the Allocator concept for this, in C++11 stateful
allocators and std::allocator_traits<>. Temporary buffers are
created with std::get_temporary_buffer() or new (nothrow).
But this wouldn't be enough for us since some of the objects (nodes)
would have to be somehow serialized into disk. So I thought about
extending it with
- storage traits (functional) - for notification of the
storage/allocator about some additional actions
- storage traits (definitions) - e.g. for defining what must be
serialized (nodes), and what shouldn't
- serialization interface - implementing how nodes should be saved
and loaded using specific storage
Storage traits could be implemented as one struct as in the case of
std::allocator_traits<> or divided into separate structs as in
the case of Boost.Geometry traits (tag<>, dimension<>,
etc.).
- Is rtree thread safe? Should our
Storage/Allocator be thread-safe?
The rtree is thread-safe more or less the same way as STL containers
are. Mutable actions aren't thread-safe (e.g. creation, insert,
remove), non-mutable should be (e.g. query).
So it should probably be possible to read data safely. If a
storage/allocator was fully thread-safe it could allow us in the
future to e.g. parallelize the persistent rtree creation. But let's
keep it simple for now :)
2. Is there any way an rtree can be
read-only? Or set read-only after (bulk)insert?
It could be done either in run- or in compile-time.
The most obvious way would be to throw a run-time
exception in the get_pointer() after some flag (rtree
created) was set.
Another possibility could be to disable rtree mutable
methods like insert() and remove() in compile-time if
some specific parameters or Storage/Allocator was
passed.
Do you have some specific use-case in mind?
At the company I’m working for :-) They use readonly maps
a lot. So in the future(hopefully)rtree with
storage/allocator as cache.
Ok, as I said, it could be done. This mechanism could be implemented
in the allocator itself, or the behavior could be defined in
compile-time in storage traits, etc.
<snip>
Other questions/points that come to my
mind are:
- Must the storage allocator always
store? What about transactions between in memory
(RTree) and Persistence storage?
- State-full allocator: Who owns the
nodes and the data inside it? I always thought
of an allocator like a factory that does not own
the data…
- When data is owned by another
STL(like) container then IMHO a storage
allocator has no use. This because the storage
allocator then does not own the data and has no
means to free/unload it.
I see 2 ways how the ownership could be implemented:
1. Similar to Boost.Interprocess where there is a memory
Manager owning the data and AFAIU an Allocator only
wrapping some reference to this Manager. So AFAIU (I
didn't check the code) the Allocator is only an
interface between a Container and a Manager. This
simplifies the copying of an Allocator, rebinding, etc.
2. An Allocator which is in the same time a Manager or
rather owns a Manager, so also owns the data. This'd
probably require that the Allocator would have to store
a shared_pointer to some Manager internally. An instance
of a Manager would be created under the hood and a
pointer to it would be copied automatically.
1 is more clear but it'd require to design not only the
interface for a Storage/Allocator but also for a
Manager, like in the case of Interprocess. 2 would only
require to design the interface of an Allocator/Storage.
If we choose 1 we could divide the logic into 2
Concepts, the Allocator could implement the
in-memory-part (pointers, memory allocation,
construction, destruction, etc.) and the Manager the
persistent part (some file IDs or handles, saving and
loading files, nodes serialization, etc.). So as in the
case of Interprocess we could have 1 Allocator and could
have many Managers.
In general I propose to work iteratively, that is at the
beginning support minimal set of required operations in
the Storage/Allocator to be able to e.g. insert() values
and perform a query() in the most simple, probably
inefficient way, but with a simple and elegant
interface. And then optimize operations and extend the
interface. Or would you prefer to design the whole
Storage/Allocator theoreticaly?
I prefer both :-) I will try to keep an UML overview with
our suggestions to keep an theoretical overview. Next to
that its probably already time to dive into code to see what
is usable and what not.
Pratical question: Should I fork BoostGeometry on github
or do you prefer a branch?
You should create your fork of Boost.Geometry on GitHub and create a
branch originating in develop in your fork.
Here is a tutorial:
https://github.com/boostorg/geometry/wiki/Contribution-Tutorial
At Boost we're using GitFlow branching model so your branch should
originate in develop and be called feature/xxx, e.g.
feature/persistency or feature/persistent_allocator, or something
like that. So you'd be able to add your code in this branch and if
something on the rtree side was needed I'd add it to develop (or you
could create a PullRequest) and you'd be able to synchronize your
branch (you could just add it in your branch). At the end you'd
prepare a PullRequest against the develop branch. Then we could
review your code and merge it. Or you could create a few smaller PRs
in the process.
Regards,
Adam