Hi Henry,
Henry Roeland wrote:
On 26 nov. 2014, at
17:48, Adam Wulkiewicz <adam.wulkiewicz@gmail.com>
wrote:
Henry Roeland wrote:
Dear all,
First let me introduce myself: my name is Henry
Roeland 37 years old and currently working as
software engineer for a company that delivers maps
to car companies like BMW and VW.
I'm working in the testtools team that is
responsible for a map viewer and maptest batch tool.
Technics that we use are C/C++, Sqlite, spatialite
and of corse boost.
Personally I have multiple years of C++ experience
and busy getting up to speed with geometry technics
in 2D. NeXT to that just started with studying
advanced datastructures and algoritms like b+tree
and r*tree.
Currently I see your rtree as a perfect candidate
for in memory container/index/cache in our map
interface. As I wrote in my previous email the
feature that is not supported yet is paging. How can
I help to accomplish this feature? Are there specs?
Or am I the only one in favior of this feature?
Certainly not the only one. People are asking about it
from time to time, using Interprocess' mapped file as
a replacement. You're the only one who is willing to
help with the implementation. I greately appreciate
it.
As I mentioned before I was thinking about a stateful
allocator/storage/manager handling file or
files/pages, cacheing etc. This allocator would return
pointers which when dereferenced would force the
allocator to load and cache data. Then, there would be
a mechanism to tell the allocator that some
pointers/data aren't used anymore. It could be done
explicitly by calling some function on allocator like:
persistent_allocator alloc(...);
rtree<..., persistent_allocator> rt(..., alloc);
rt.insert(...);
rt.insert(...);
rt.insert(...);
alloc.flush();
I don’t know about an rtree but from my homework:-)
concerning b-trees I understood that pages can be
handled as nodes themselves. Pages can be split and
contain multiple nodes depending on size. If we could
create a stable interface for Page(Nodes) then this
interface can become public leaving the Node interface
hidden???
It can also have other advantages to alloc a group of
nodes (Page) together in terms of memory alignment(Your
don’t need a smart pointer for every node only for the
page?), Also in terms of (de)serialisation and bulk load
handling a group of nodes would IMHO be better (if its
possible).
Of course you're right, nodes could (or even should) be
grouped in pages, loaded together etc. Also we should
probably group close nodes together in one page (e.g.
parents with children), allocator hint could be used for
this. But I was thinking about leaving the desicion how to
store nodes to the allocator/storage. As you said one
could use big pages, one could store 1 node per file, or a
database etc. And still somehow a page or some other
storage or Archive must know what data should be
saved/loaded per node. In my scenario this would just be
implemented in a Serialization-like way. Higher level
aspects like pages would be closed in allocator/storage.
Of course assuming that this could be done this way but I
feel that it could.
I think that a pointer still would have to store some ID
of a node inside a page to identify an exact placement of
the data. Or at least some global ID somehow indexed to
access a specific node in a specific page.
The next step could be
adding of some hooks or traits for allocators which
would be called by the rtree internally on the end of
some operation, e.g. insert. So this wouldn't be a C++
allocator anymore since it'd have additional methods.
To avoid storing a reference to allocator/storage in
those smart pointers the rtree could notify the
allocator/storage when a pointer is dereferenced and
data is required. So the responsibility for the
notification of allocator/storage would be moved from
a pointer to the rtree.
The Page Node can administrate its size and its free
size and have a smart_pointer telling when it was last
used and dependency_checker /unload checker with
responsibility for notification. Maybe we can use Boost
Signals2 for notification?
Together with a Page/Storage manager I’m thinking about
the following points:
- Policy for when to load/unload. traits as you point it
out.
- Size of rtree must be known at all times. If some
limit is reached start unloading (allocator can tell the
size?) to bottom pages that are less used?
- 25%/50%/75% percent of the top of the tree must remain
in memory for speed???
- define 3 actions/states for a page: loaded, unloaded,
unloaded but cached as memory map.????
So what you're describing here is how a paging
allocator/storage could work. Sure if the storage was
notified by the rtree about the performed actions it'd be
able to do whatever it needs, cache pages, store some kind
of usage counter to know which pages are used most often,
support various policies, etc.
I was describing more low level stuff that would have to
be implemented on the rtree side. Basically 2 things
notifications for actions and an interface for
serialization of nodes and rtree header. Or would you
prefer if I handled it, that you could focus on the
implementation of a specific allocator/storage?
There is also another thing that is mentioned in another
thread (started by Tim). There should be some way of
creation non-rtree, temporary pages, e.g. for data stored
in a vector or some other container designed specifically
for this purpose. But this could be handled later.
Finally the rtree could
notify the allocator/storage each time the data owned
by a pointer is no longer needed but this probably
wouldn't change anything. Furthermore modifying and
releasing the nodes before an operation is finished
wouldn't be safe in case when an exception was thrown.
In fact, this allocator should keep a copy of nodes in
memory during a whole operation and then at the end of
the operation save the data into a persistent storage.
The next thing is how to serialize data from and to
node. We should ask ourselves should we allow the
users to implement other persistent storage variants
on their own. If the answer was no, then we could just
internally in the persistent allocator use the
internals of the rtree. But if we wanted to allow it
(which IMO should be done) the most straightforward
way would be to expose the interface used by the rtree
internally to access node's elements. Exposing it
would mean that we couldn't change it in the future,
so I'd prefer to avoid it. Instead, we could mimic or
directly use Boost.Serialization interface for Nodes.
In this scenario a specific Node would know what data
must be saved and loaded and it'd be able to save or
load to/from some generic Archive. Depending on the
needs this could be some temporary 1-node Archive
gathering data for later or a target Archive capable
to serialize data on the fly, it'd depend on an
allocator/storage.
This way we could also support versioning on a node
level, the same way how it's done in Serialization. So
changes would have to be done on a nodes level not in
the user-defined allocator. An example could be an
implementation of a rtree variant storing additional
data in nodes (hilbert r-tree) or additional types of
nodes (PRtree). Also arbitrary user-defined Values
would be serialized the same way (using Serialization
or familiar Serialization-like interface).
And this way we'd also support Serialization in one
go.
We'd probably need some file header with the
parameters of an rtree both in persistent storage and
in serialization support. Similar as with nodes, an
rtree could know how to load/save the header from/to
some Archive. The rtree should e.g. check if it's
possible to use persistent storage and load data, e.g.
if it wasn't created using incompatible parameters,
etc.
How is it with backwards compatibility? I can imagine
that Storage work can break something already running?
I think that this won't be problematic.
If I’m talking rubbish
just tell me I can handle it:-)
All ideas are valuable :)
Thanks for beneath info
to getting me started. I will have to read it a few
times to understand it all.
If you have any questions just ask. If I handled it on the
rtree side and you was working on the allocator/storage,
you probably wouldn't even be forced to know all of this
internal stuff. Not that I have something against it :)
Regards,
Adam