Subject: [boost] [Review:Algorithms] My review of Boost.Algorithm
Date: 2011-09-30 18:02:23
First of all I have to clarify that I have only studied the three search algorithms and I can only speak about this part of the library. Furthermore my review was by no means exhaustive, I have only emphasized in few things which I had the time to notice.
However I have noticed at least two showstoppers, one in documentation of the search algorithms and the other in the design of the KMP algorithm. Hence my vote for now is "no". I personally believe that the search algorithms should not be accepted, unless these issues are addressed, (or unless I am proved wrong of course) although it would be undoubtedly a good idea to have such a library in Boost. Specifically in the design issue of the KMP algorithm I am willing to help if this becomes necessary.
A. Documentation issue
The three search algorithms of this library does not offer any additional functionality compared to the standard std::search of the STL. On the contrary these algorithms are less generic than std::search. Consequently the only single reason that will make a developer to choose one of them instead of the std::search can be their performance superiority. (In the most notable STL implementations std::search is implemented using straightforward (brute-force) searching.)
However in the documentation I have not found anything (no benchmarks, measurements, etc) that would help the user of the library to understand how fast each algorithm is and in which situations it is advisable to use it. (I believe Phil Endecott has mentioned this issue as well) This is especially important because of the following reasons:
- Boyer-Moore and Boyer-Moore-Horspool have identical average time complexity, hence it is not easy for the user to choose among them based on their complexities.
- Although KMP has much better worst-case complexity than a typical std::search implementation, their average complexities are practically the same.
- The straightforward (brute-force) sequence searching is known to frequently outperform some KMP implementations.
For all these reasons I believe that some performance analysis and performance guidelines should be added to the documentation, before this library becomes a part of Boost. For example it would be very nice to have a graphical performance comparison of these three algorithms together with the std::search, in different pattern and alphabet sizes.
B. Design issue
The fact that the KMP implementation of this library requires random-access iterators for both the pattern and the corpus, seems like a major design drawback to me. I also believe that this drawback virtually cripples the usefulness of the algorithm, for reasons I have already mentioned in previous posts.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk