Subject: [boost] [heap][formal review]
From: Thomas Klimpel (Thomas.Klimpel_at_[hidden])
Date: 2011-06-13 09:02:38
this is my actual review of Boost.Heap.
I took part in this review, because I was interested in the results of Boost.Heap for my old heap test/benchmark (related to the fast marching algorithm). So I decided that trying to use Boost.Heap with the old test code was a good experiment for a review, allowing me to see how helpful the documentation is, how easy it is to use the heaps and how well they perform.
- The documentation is easy to read and helpful with respect to the questions it tries to address. The main question it doesn't try to address is how to make effective use of Boost.Heap, i.e. which heap to use for which application and how to use it optimally.
- Boost.Heap is easy to use, at least in principle.
-- One inconvenience for me was the use of max-heaps instead of min-heaps. Well, one could argue that a priority_queue should return the elements with the highest priority first (but as Wikipedia says: "some conventions consider lower priorities to be higher"), and heap-sort uses a max-heap. Since these are the two application where stl uses a heap, it's clear why stl implements a max-heap. But the applications where it is beneficial to call increase or decrease directly normally use a min-heap. Now I have to call increase even so the key value actually has decreased...
-- I was left a bit with the feeling that I hadn't made the most efficient use of Boost.Heap. For example, I quickly found out that calling update just with a handle (as I had implemented it first) was suboptimal for fibonacci_heap.
-- The segmentation fault in pairing_heap was unexpected for me. Even more unexpected was Tim's response that it's actually expected. So I tried to inform me a bit. I found statements like "A pairing heaps is a type of heap data structure with relatively simple implementation and excellent practical amortized performance" and "Experimental studies indicate that pairing heaps actually outperform Fibonacci heaps", which are incompatible with pairing_heap segfaulting for a simple testcase (IMHO). My conclusion is that the statement "due to the recursive nature of this data structure, ... beware of stack overflows" is actually misleading. This implementation of pairing_heap actually has a known issue. As long as this issue is not fixed, the structure may be interesting for tests, but "production" use should be strongly discouraged.
- The performance numbers for Boost.Heap were neither bad nor outstanding.
-- The numbers for b_heap are not convincing
-- The numbers for fibonacci_heap seem to be the best.
-- The pairing_heap has a smaller compare count than fibonacci_heap, but actually performs worse than fibonacci_heap in this test. Considering statements like "Experimental studies indicate that pairing heaps actually outperform Fibonacci heaps", I wonder whether the implementation of pairing_heap could be improved so that it actually outperforms fibonacci_heap.
-- The comparison with TStaticHeap and TSequentialHeap should be taken with a grain of salt. I had first just added fibonacci_heap to the test, which is a node based heap. The test case isn't very favorable for node based heaps. And for the other heaps, I used the same class template, so that .reserve isn't called (it doesn't exist for node based heaps), but the analog function is called for TStaticHeap. And the tuned TSequentialHeap uses "debatable"/"unusual" techniques.
> What is your evaluation of the design?
I guess it's OK. What makes me a bit nervous is the statement "The library is not targeted in any specific domain nor directly related to any specific libraries". A good interface often is the result from an abstraction over different applications. The other way (coming from the data structure/algorithm) risks bloating the interface with unnecessary functionality. For example, somebody asked during the review for "iteration over the heap in priority order". Is this really something that many applications requiring a heap will benefit from? I heavily doubt this.
> What is your evaluation of the implementation?
I haven't looked too deep into it, but I think it's OK. I'm not sure whether there isn't a bit too much code reuse and "one size fits all". Was it really possible to implement all the heaps in their optimal form with this approach? But I'm not expert enough to really judge this.
I was a bit surprised how pairing_heap::merge_children was implemented (was it just a mistake?), but I don't believe that this is caused by the code reuse or the "one size fits all". It's possible that similar problem also exist in other functions, but that doesn't change the good overall impression. It's only natural that such problems are only brought up when the code is used by more people than just the initial author.
> What is your evaluation of the documentation?
The documentation is easy to read. It doesn't address the question how to make effective use of Boost.Heap, i.e. which heap to use for which application and how to use it optimally.
One benefit of the library would be to educate the user about the different possibilities. An important thing here is worse case performance and amortized performance, why are different heaps needed at all, ...
> What is your evaluation of the potential usefulness of the library?
There are different use cases I can see:
1) You already have a heap in your software, but you want to see how good it performs with respect to what else is out there.
2) You need a heap as part of a more complex algorithm.
3) You need a priority queue for scheduling something.
The library in its current form is actually really well suited for 1, because all the heaps have a very similar interface, which makes it easy to test many different heaps without much effort.
The library is also quite useful for 2, even if it would be nice if the documentation would save me some of the required experimentation. I also fear that the uniform interface of the heaps sacrifices some optimality for some algorithms.
- Take the fast marching algorithm as an example. I typically have one bulk with the arrival times and one bulk with the speeds. I'm not especially keen on having another bulk with "heap_t::handle_type", as that may double my memory consumption on a 64-bit system, if arrival times and speeds were of type float.
- However, the alternative is giving up the uniform interface of the heaps, just for some very questionable savings for some special applications. Because for Dijkstra's algorithm, the memory consumption is dominated by the edges of the graph, so that I don't care about an additional array with "heap_t::handle_type" of the same size as the number of vertices of the graph.
I have the impression that std::priority_queue is already sufficient for 3, so that Boost.Heap wouldn't be required for that use case.
Overall, I think it is quite useful.
> Did you try to use the library? With what compiler?
> Did you have any problems?
Yes, I used it. gcc-4.3.4 (cygwin), gcc-4.3.3 (win32), gcc-4.4.3 (ubuntu), gcc-4.5.2 (win32), MSVC-9, MSVC-10. There were some minor problems, but all the problems I actually reported have been fixed (except the unexpected segfault for pairing_heap). So may main problem was probably that I felt a bit left alone with respect to the question which heap to use and how to use it optimally.
> How much effort did you put into your evaluation?
> A glance? A quick reading? In-depth study?
More effort than I had actually planed to spend, but I already knew from previous reviews that this would happen. I think I did a reasonably deep study from a user point of view.
> Are you knowledgeable about the problem domain?
Yes, at least from the user point of view. I have heard of binomial heaps, d_ary heaps and fibonacci heaps before. I didn't knew about b_heap, pairing_heap and skew_heap, but I knew that there are many different heap variants out there. For example I knew about the sequential heap described in "Fast Priority Queues for Cached Memory" by P. Sanders (which doesn't mean that TSequentialHeap is identical to the sequential heap described in that paper, but it isn't too different either).
> Do you think the library should be accepted as a Boost library?
No, the library should not be accepted unless
- The current review either magically goes back on track and at least 3 other persons also submit a proper review, or another (mini?) review with proper participation is held later.
- Either pairing_heap is fixed, or non-experimental (="production") use is strongly discouraged in the documentation.
- Either a convincing example where b_heap outperforms the other heaps is presented, or it's use should be discouraged in the documentation.
- The documentation says a bit more about the different heaps and their trade offs (amortized time <-> worst case), and generally tries to give recommendation to the user which heap will likely best fit his application, and which settings he should try (for example something like that arity<4> is a good compromise for a d_ary heap between cache locality and compare count).
I used the negative form of "Yes, the library should be accepted under the following conditions", because I'm unhappy with how the review went. The announcement was sent out too late, without stating the dates of the review, much of the initial discussion was about the code collaborator tool, but I haven't seen any "real" review, despite that fact that there seems to be interest in the library. There were some discussions on Code Collaborator, and the tool was actually not bad in pointing to concrete problems in the actual code. Tim quickly addressed these, but a good review would have some discussions and actual opinions and suggestions. It is also unclear to me how well Boost.Heap performs with respect to boost/graph/detail/d_ary_heap.hpp. I could have tried to also include this heap in my heap test/benchmark, but as the review seemed to have silently died already, I wasn't motivated enough for doing this.