Boost logo

Boost :

Subject: Re: [boost] [GSoC] Some new ideas for this year's GSoC
From: Stefan (mstefanro_at_[hidden])
Date: 2010-04-08 21:55:41


Stefan wrote:
> Hello, I am Stefan and this is my first year at GSoC as it's the first
> time I'm eligible to participate as a student. Since I have been using
> C++ and the Boost libraries for quite some time, I would be thrilled to
> be able to code for Boost.
> I have quite a few ideas, and I was wondering which ones you consider to
> benefit the most, therefore I'm awaiting for your feedback. Also, please
> let me know if any of these ideas are already implemented in any of the
> libraries. Here we go:
>
> * Selection algorithm (find the kth smallest element in a container,
> O(n) worst-case, very easy to implement)
>
> * Optimize Boost String Algorithm finders to use an efficient substring
> matching algorithm as opposed to the naive algorithm. Most likely
> because of the fact that boost has implemented the finders for strings
> as generic algorithms, they haven't used efficient algorithms such as
> KMP, Boyer-Moore, Rabin-Karp etc.
> \see boost/algorithm/string/detail/finder.hpp
> The idea is to implement efficient algorithms such as the ones
> mentioned above to replace the current naive implementation. Note: it
> may be difficult to find nth matching substring position using some of
> these algorithms.
> Difficulty: easy-medium
>
> * Implement a string algorithm for generating the suffix array. The
> suffix array, after being precomputed on a certain string, allows for
> very efficient substring searches in the future. This is useful when the
> string doesn't change too often but there are plenty of substring to
> search against.
>
> * Yet another efficient associative data structure, skip lists?
>
> * Space partitioning data structures? kd-trees? quadtrees? octtrees?
> collision detectors? closest neighbors?
>
> * A parser capable of evaluating mathematical expressions which contains
> operands,operators,function calls,variable names and constant names
> (also considering priority of operators):
> 1. a list of valid operators described by: a) number of operands
> (unary/binary/ternary?); b) allowed operand types; c) the precedence
> (priority of the operator)
> 2. a list of existent functions described by: a) function name; b)
> number of parameters; c) type of each parameter and return type; d) a
> callback (maybe just automatically deduce type based on the callback's
> type if possible)
> 3. a list of constants
> 4. a list of allowed variable names and types
>
> Here's a pseudo-C++ code sample on how it should look like:
> namespace m = boost::math_expressions_parser;
> m::parser p;
> p.add_operators()
> ('+', 1, ret<int32_t>(_1+_2)) // binary +, precedence 1
> ('*', 2, ret<int32_t>(_1*_2)) // binary -, precedence 2
> ('+', 3, ret<int32_t>(_1)) // unary +, precedence 3
> ('-', 3, ret<int32_t>(-_1)); // unary - (also a funny face), precedence 3
> p.add_functions()
> ("sin", ret<int32_t>(sin(_1)));
> p.add_constants()
> ("PI", 3);
> p.add_allowed_variables()
> (allowed_var<int32_t>("x"));
>
> m::expression E("1 + 3*sin(PI*x)", p); // evaluates mathematical
> expression E against parser p
> if (!E.valid()) { /* ... */ }
> else {
> std::cout << E.evaluate( ("x", 3) ); // evaluate this expression by
> letting x=3
> E.let() ("x",4);
> std::cout << E.evaluate();
> }
> //Maybe an optimization so only the parts that depend on the changed
> variable will be recalculated every time?. E.g.: for expression f(x)+y,
> f(x) is recomputed only when x is changed. This may cause problems for
> functions that do not always yield the same results for the same input.
> Maybe make this optimization optional, only if the user marks the
> function/operator as "volatile" or something
> Difficulty: hard-very hard
>
> * Someone suggested on this ml earlier that a crypto suite would be
> useful for boost. Maybe, but it would take a lot of time and dedication
> to implement. I believe that we should at least add some Cryptographic
> hash functions (such as md5, sha1, sha256 etc.). No one wants to install
> Crypto++ or something similar just to be able to compute the SHA1 sum of
> a file. It is a common necessity to want to compute a hash function of a
> string.
> Difficulty: very easy if you copy the implementation and just create a
> boost-ish interface
>
> * In-place radix sort? Radix sort is a very efficient algorithm which
> performs better than std::sort (my implementation) (also asymptotically
> better) for some particular types such as: uint8_t, uint16_t, uint32_t,
> unsigned char, unsigned char[2], unsigned char[4] etc. Radix sorting
> takes linear time, but unfortunately, linear memory. It is very useful
> for sorting very large amounts of numbers tho (or genetic codes and
> maybe some other stuff).
>
> I am able to implement some of these stuff and I'm waiting for your
> opinion in what would be most useful to be part of boost. Let me know if
> you believe that none will have any use.
>
> Yours sincerely,
> Stefan
>
> _______________________________________________
> Unsubscribe & other changes:
> http://lists.boost.org/mailman/listinfo.cgi/boost
>

Here are my ideas before submitting a proposal:
  * Selection algorithm (quick implementation, useful for efficiently
finding kth smallest number in a container offering random access iterators)
  * In the String algorithms library, add optimal algorithms for
searching a substring in a string: Rabin-Karp, Knuth-Morris-Pratt,
Boyer-Moore, possibly Boyer-Moore-Horspool and suffix arrays. As has
been previously discussed in the mailing lists, these algorithms can
offer significant improvements to the current substring searching
implementation in boost::string::find_*(). Some of these algorithms
perform better when the string changes less often, others perform better
  when substring changes less often and their efficiency is quite
dependent on the input data.
I believe it would be a good idea to perform benchmarks on these
algorithms using various types of input data: random strings and
substrings, strings and substrings taken from the English language,
substrings that occur often in the strings, substrings that occur only
partly in the strings, very long substrings, very short substrings,
searches with very small alphabets (such as DNA) etc.
It is also my belief that it would be a good idea to add such benchmarks
  to the documentation pages so that the users of these algorithms will
know which one will be the most suitable, having additional information
about the input.
  * Extend the bindings to Python of Boost.Graph or write the topology
generators (whichever you consider more important or of a higher
priority). Personally, I would prefer doing the bindings to python,
because Python is one of my favorite programming languages and I've
never done something similar before. This would be beneficial for me, as
I'm sure I can do it and I will also have the chance of learning
something new (i.e. improve my python and my boost.graph skills)

Can you please review my list of ideas and let me know what you think?

Regards,
   Stefan


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk