|
Boost : |
Subject: Re: [boost] Interest in B-tree library for Boost?
From: Beman Dawes (bdawes_at_[hidden])
Date: 2010-09-15 15:10:06
On Wed, Sep 15, 2010 at 12:44 PM, Simonson, Lucanus J
<lucanus.j.simonson_at_[hidden]> wrote:
> Beman Dawes wrote:
>> A Boost B-tree library would provide disk-based associative containers
>> that scale all the way from really, really, small to really, really,
>> large. B-trees perform well on hardware ranging from ancient floppy
>> disk drives all the way up to humongous disk arrays. They are the
>> technology behind most high-performance disk file systems and
>> databases.
> ...
>> Initial timing tests have been very encouraging. The time test
>> compares the btree_map implementation to std::map. As expected, a
>> btree_map is many times slower than a std::map, given a cold operating
>> system file cache and constraining the btree_map's caching to a
>> minimum. But at the other extreme, when the O/S file cache is warm and
>> the btree_map's cache is as large as the file, the btree_map is very
>> similar in speed to std::map.The btree library's code is still very
>> new, so timings and everything else may be subject to lots of
>> revision.
>
> Was this post prompted by the Judy tree question raised yesterday?
No, I haven't been reading the list recently. I started work on the
library in 2000, but became convinced that the C++ language needed
changes to really support B-tree's well. That was the motivation for
my POD proposals, now part of C++0x.
I tried again in 2006, but didn't separate binary file, buffer
management, and core B-tree needs sufficiently, so abandoned that
work.
Then in May of this year, I started again, and have been much happier
with the results.
> Is the code in the sandbox?
It is available as a .zip file from
http://mysite.verizon.net/beman/btree/btree-preview-1.zip, and as a
git repository from http://github.com/Beman/Boost-Btree
> Now that we have broken free of the 32bit addressable limit for commodity hardware there is less need for self paging data structures than there was a few years ago.
That may be true, but it has little to do with many of the uses of
B-trees. Even when you have enough memory to read an entire container
into memory, why would you want to do that if you just want to access
a tiny subset in any given time interval? Preloading is useful, and an
option in the proposed library, when there is a lot of memory
available and you may need access a high proportion of the entries.
But it shouldn't be a requirement as that limits use to environments
where the amount of memory is large in relationship to the needs of
the application. One of the nice characteristics of a B-tree is that
they work very well in constrained memory situation, but then can take
advantage of lots of memory when available.
>Unless there is a way to configure it that gives a performance advantage over std::map...
If the data lives on disk, a B-tree will have much better performance
than loading an entire std::map into memory, and then performing
sparse operations on it.
>
> I looked into the details a little and I found it funny that you followed the advice I gave on the geometry list a couple days ago in the discussion of the design of a "generic" tree:
>
> "Node could even be declared as a private sub-class of tree to keep people's hands off of it and make it easier to implement by allowing it to share the template parameters of tree."
The implementation of nodes is a messy detail. No the sort of
low-level stuff users should have access to.
> I notice you made your private branch and leaf types classes instead of structs. Since they are tightly coupled to the tree by virtue of being private member classes I'm not sure they need data hiding to protect their members from the tree. I would have made them structs. You could be right, however, that protecting oneself from oneself is worthwhile. Certainly there is rarely anyone else writing bugs and violating design intent in the details of most libraries.
It will get even messier when there is an alternative note
representation for variable length data. So hiding the details, even
from other parts of the implementation will be a necessity.
Cheers,
--Beman
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk