Subject: Re: [boost] [sort] Timsort review reminder
From: Pete Bartlett (pete_at_[hidden])
Date: 2017-06-07 14:44:18
On 7 Jun 2017, at 13:31, Niall Douglas via Boost <boost_at_[hidden]> wrote:
>> Francisco and I are in the middle of rewriting the docs for the Sort
>> library. Integrating Timsort in there once done will be relatively simple.
> I really wish you'd supplied that single page already. New Boost code is
> supposed to come with documentation.
>> I'll be honest here: I personally would have felt this pull request
>> better reviewed and handled by Boost.Sort's maintainer. It's too small
>> and limited for a full fat review by the entire community unless there
>> is something very controversial about it and the maintainer feels the
>> community needs to invest a week into thinking about this.
>> I would prefer a mini-review myself (as I agreed to do when adding new
>> algorithms to the collection).but Ronald suggested I do a full review.
>> My main questions are: Does anyone care about Timsort?
> My sole experience with Timsort was many years ago when I was confounded
> by some Python code I was writing where a sort was not scaling with
> input as it should have done (it was considerably better than N log N,
> and I couldn't see how that was possible).
> So yes, Timsort should be in C++, preferably with std::sort() doing it
> automatically when the to be sorted set is big enough to make it
> worthwhile. Exactly where than cutoff is is very hard to estimate though
I agree that TimSort should be in Boost. But it isn't just a matter of size - TimSort excels when the input is already mostly-sorted at the possible cost in the case when it is really unsorted. Of course std::sort has to work for everything without being told ahead of time whether it is mostly sorted or not. So it would be great to be explicitly choose TimSort from Boost.Sort when I /do/ know that.
> And I haven't looked at the implementation, a big worry would be
> exception safety and guarantees
I took a brief look at the pull request. Apologies if I wasn't looking at the latest code but brief comments:
- Tests didn't appear to cover custom comparator cases at all.
- Corner case tests give misleading re-assurance. They test for the zero and one element cases, but these are irrelevant for TimSort - very small inputs automatically fall through to a simple sort. You need to test the real boundaries (around 32 elements?)
- Given users are only going to use this to squeeze performance, why are the internal vectors tmp_ and pending_ hard-coded to std::allocator<T> - the user may want to provide an alternate allocator or even a completely separate workspace - he may have an existing workspace that can be re-used - should this be a customisation point?
- Documentation needs to give some comparison against some popular std::sort implementations. Against which and for what data is TimSort more performant?