Boost logo

Boost :

Subject: Re: [boost] [review][sort] Formal review period for Sort library begins today, November 10, and ends Wednesday, November 19
From: Niall Douglas (s_sourceforge_at_[hidden])
Date: 2014-11-15 09:13:21


On 14 Nov 2014 at 21:34, Steven Ross wrote:

> >> - What is your evaluation of the design?
> >
> > It's okay. I would prefer a design where there is a sorted_view which
> > if std::copy gives an actually sorted copy. Obviously one would then
> > specialise that scenario with a more directly optimised
> > specialisation, and you'd also have the API presented here which
> > looks like std::sort().
>
> I'm not sure exactly what you mean by a sorted_view.

some_iterator first, last;
sorted_view v(first, last); // O(N), builds an index
for(auto &i : v) // O(N log N), but can early out
 ....

This is the sort of stuff that separates a sort function from a Boost
general purpose sort library. sorted_view builds a lookup index in
O(N) time and space and can be looked at as a fixed initial cost.
Iterating the view may take considerably longer, but because the O(N
log N) complexity is pushed into the iteration, code can early out or
do other more efficient things. Point is you give options to code
with this design rather than sorting being all or nothing as it is
with std::sort. What one is looking for is *significantly* improving
on the STL sort facilities, and performance is one of many ways of
improving on the STL.

A particular advantage of the above design is that it should be much
more cache friendly for large datasets. You blitz the LL cache in a
single pre-iteration step when building the index, after that
iteration should only touch a small subset of total cache lines.

As a practical example of utility, coincidentally this Monday I
should be starting yet another new Boost library, this one
implementing a portable kernel memory allocator and manager. This
abstracts the management of memory allocated in the kernel. One thing
you can do with it is work with sparse really huge datasets, ones
which cannot possibly fit into a process address space. A sorted_view
design would let you build the sort index as a very slow initial
overhead (each window into the sparse data has to be individually
mapped, processed and unmapped) and then iteration could "pick off"
the top N items from the sort using very few window maps.

Another example of what I mean by sorted_view. Hopefully it makes
sense now.

> > I'd also prefer to see fallback implementations for less than random
> > iterators, right down to forward only iterators.
>
> To sort at reasonable speed requires either random access iterators or
> allocating a deep or shallow copy of the data. I could write a simple
> wrapper to create a vector of iterators pointing to the input data,
> sort the iterators, and then use them to permute the original list,
> but what's the use case?

You're assuming sizes of dataset which can fit into memory, and/or
all items are available quickly to the sorting process.

I am also unsure if random access iterators really are necessary. A
brute force solution is to create a copy of a forward only iterator
one per item, and swap those around. I'm sure there must be sort
algorithms that can do better than that though.

> > 2. Exception safety is not obvious, and explicit exception guarantees
> > need to be made and stated per API in the documentation.
>
> How about statements like this in the documentation for each of the headers:
> Exceptions: Throws if any of the element comparisons, the element
> swaps (or moves), the right shift, subtraction of right-shifted
> elements, or the operations on iterators throws.

Exception safety is about what "throws" exactly does at any point in
the sort. Does it leave things half sorted? If all your operators
above are marked noexcept, can you use an inplace sort instead of
making copies? Does your sort recover from a throw, or take
alternative action e.g. exclude the items where a throw happens. That
sort of thing.

Generally the STL follows a rule of leaving inputs untouched if an
exception is thrown which is why when operators are not noexcept it
must do a lot of copying. This is where std::move_if_noexcept comes
from. Probably code entering a Boost library should copy STL
conventions here where appropriate.

Now, none of the above I think is a requirement for entry into Boost
which is why I didn't make any of it a condition. Just food for
thought on how to make a really valuable general purpose sort
library. Your library, if renamed to Boost.SpreadSort, is not far
from entry as is.

Niall

-- 
ned Productions Limited Consulting
http://www.nedproductions.biz/ 
http://ie.linkedin.com/in/nialldouglas/



Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk