Boost logo

Boost :

Subject: Re: [boost] [stl_ext_adv] ] Interest in advanced data structures
From: Vadim Stadnik (vadimstdk_at_[hidden])
Date: 2011-12-08 03:31:56


Hi all,

This message is a detailed reply to the analysis and comments by Joaquín M
López Muñoz.

The most challenging request, which required quite significant work, is
concerning memory consumption and performance of augmented B+ trees. This
aspect is quite broad. It is covered in two new sections of the updated
documentations:

1) http://dl.dropbox.com/u/51041088/stl_ext_adv_doc/doc/bp_trees.html , the
section “Space requirement“ ;

2) http://dl.dropbox.com/u/51041088/stl_ext_adv_doc/doc/performance.html , the
section “Effect of augmenting on performance of B+ trees” ;

The last document with performance measurements has been modified quite
considerably. It now includes tests of all variants of proposed B+ trees
and boost::multi_index_container as multiset.

Please have a look at the updated documentation and my replies in the text
below and let me know if I missed anything significant.

On Sat, Dec 3, 2011 at 11:25 PM, Joaquin M Lopez Munoz <joaquin_at_[hidden]>wrote:

> Vadim Stadnik <vadimstdk <at> gmail.com> writes:
>
> >
> > Hi all,
> >
> > Advanced data structures (ADS) offer useful STL extensions that are not
> > possible or not efficient with basic data structures, such as arrays,
> > linked lists and red-black trees.
> >
> > The project
> >
> > https://github.com/vstadnik/stl_ext_adv
> >
> > http://dl.dropbox.com/u/51041088/stl_ext_adv_doc/index.html
> >
> > focuses mainly on a generalized and unified computer representation of
> > sequences using the augmented B+ trees. It provides the following STL
> > extensions [...]
>
> Hi,
>
> This is a rough study of the functionality provided by your lib in
> general and as compared with Boost.MultiIndex.
>
> * Container_bptree's internal data structure relies on B+
> trees with two variants based on the "augmented data structure"
> idea explained in Cormen, Leiserson, Rivest and Stein's "Algorithms"
> (an online explanation can be found at http://tinyurl.com/d8lze7v ,
> for instance.) Briefly put, such augmenting scheme boils down
> to adding additional information to the nodes updated on insertion
> time and erasure time which is based on the additional info of the
> sibling nodes. The variants are:
> - bp_tree_idx: augmented to allow for efficient indexing
> - bp_tree_idx_acc: doubly augmented to support efficient
> indexing plus a generalized "accumulate" notion usable for
> calculating summatories, mean values, etc.
>
> * Question #1: why augmenting on B+ trees? The augmenting idea
> can be as easily built on top of RB-trees, which are
> the weapon of choice for implementing associative containers.
> Unless proved otherwise, RB-trees are assumed to be faster than B+
> trees for in-memory structures (which your own tests hint at, cf.
> "insert N random elements" table.) You say:
>
> "The red black trees are yet out of the scope of this project
> for the reasons explained in the documentation."
>
> I've failed to find such explanation. Also, I suspect memory
> consumption is higher with B+ trees than RB trees.
> - Request #1: explain why B+ trees rather than RB trees.
>

That’s my fault, I should have added to the original message reference to
this document:

http://dl.dropbox.com/u/51041088/stl_ext_adv_doc/doc/bp_trees.html
The section “Advantages of B+ trees“ briefly covers the topic.

 - Request #2: provide code for performance tests so that
> Boost.MultiIndex and oher data structures can be compared.
>

The newly added file “perform_tests.hpp” contains the code of test
algorithms. My tests and timer are Windows specific, I am planning to
replace the timer with one of Boost timers and update this code. I have
added performance measurements for Boost multi_index_container as mutiset
to the document mentioned at the top of this message.

> - Request #3: augment the performance tests so as to yield
> memory consumption figures.
>
>
This request is covered in two new sections of the updated documentation as
explained at the beginning of this message.

> * Usage scenario 1: Plain B+ trees as a replacement for RB tress
> for associative containers. My hunch is that RB-trees would
> perform better, as explained above.
>

I agree absolutely,
Replacing red black trees with the augmented B+ trees is only justified if
it is beneficial for a user algorithm, such as using a new extended
functionality or a new efficient algorithm.

>
> * Usage scenario 2: sequence-like containers with efficient
> insertion/erasure. Directly based on bp_tree_idx, with a
> do-nothing sorting critera. These containers are not truly
> random-access, but rather provide O(log n) access, which can
> be acceptable (or not) for some applications.
>
>
This is correct, one operation supporting random access iterators has
logarithmic cost in the worst case. The new sequences are a good choice
when a user algorithm requires efficient insert, erase operations, which
outperform std::vector as shown in performance tests. They can also replace
inefficient bidirectional iterators of std::list.

> * Usage scenario 3: associative container with random access
> iterators. This roughly compares with a multi_index_container
> equipped with two indices, one ordered, the other random-access.
> Basic differences between both:
> - The multi_index_container provides truly O(1) random access
> whereas Container_bptree is O(logN) (which is not random access
> strictly speaking.)
> - With the multi_index_container, sorting order and insertion
> order are separate --some sort of synchronization between both
> indices should be needed to mimic the Continer_bptree case.
> - Insertion/erasure with the multi_index_container is linear
> (though much faster than with std::vector), logarithmic with
> Container_bptree.
> - The memory overhead introduced by the multi_index_container
> is two pointers vs. only one word for Container_bptree.
>
>
The Boost Multi-index library offers a very rich functionality. My
knowledge of this library is not yet sufficient to make specific
suggestions. Nevertheless, I think that this library can benefit from the
proposed B+ trees and containers as well as from the method of augmenting
applied to other data structures. I will be happy to help in this area.

> * Usage scenario 4: containers with fast accumulate, a direct
> application of the augmenting idea explained above. There is
> nothing in Boost that provides this functionality.
>

Thank you for confirming this, since in one of previous messages there was
information that logarithmic efficiency of the standard function
accumulate() is already supported by Boost library.

>
> My conclusions:
>
> 1. Augmented data structures (of which efficient indexing
> is just a particular application) are definitely useful and
> worth having in Boost.
> 2. Unless proved otherwise, we should be using RB trees
> (or maybe AVL) rather than B+ trees.
> 3. Depending on the particular needs, some usage scenarios
> can be better served with Boost.MultiIndex (note I'm
> saying "depending on needs", since differences between
> approaches are substantial.)
> 4 To me, rather than providing out of the box containers with
> augmenting we should provide this via two libraries:
> - As an additional capability of Boost.Intrusive associative
> containers (and AVL-tree based containers as well, if
> technically possible.)
> - As an additional capability of Boost.MultiIndex ordered
> indices.
>
> Hope this helps,
>
> Joaquín M López Muñoz
> Telefónica Digital
>
>
Thank you for your analysis and conclusions.

I am happy for the Boost experts to decide how to use the proposed data
structures and containers. To start work I need a list of tasks and/or
requests with priorities.

Regards,

Vadim


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk