Boost logo

Boost :

Subject: Re: [boost] Fwd: [SORT] Parallel Algorithms
From: Steven Ross (spreadsort_at_[hidden])
Date: 2015-06-06 14:49:42


Hi Francisco,

> In my opinion, sort::parallel is better leave in the actual place. When
> finish the process with sort::parallel, I have intention to move my other
> library CounterTree to the Incubator.

Sounds good to me.

Thanks for changing your examples to run off of files! That's easier
to test and debug.
Please provide your examples with a name of what they do, not just 1, 2, 3.

I suggest setting up all your benchmarks to run within 5 minutes by
default, and have a flag to do the full-size test.

Feel free to rip out timsort if it causes you any trouble; I'd only
mentioned it because it's a popular sort. Please make sure the
license allows you to include it in Boost.

> I had seen several grammatical mistakes in the documentation, due to the
> fatigue, the excess of cut and paste and my lack of English grammar.

There is a mistake in the sort::parallel documentation:
"These algorithms are generic, can sort any kind of object using a
comparison object, in contrast with the specialized algorithms, as
spreadsort, which only sort integers, strings and float values in a
single thread version, but with a speed greater than the generic sort
algorithms."

spreadsort can sort any data for which you can define a strict weak
ordering (just like std::sort). See
https://github.com/boostorg/sort/blob/master/example/generalizedstruct.cpp
for a moderately complicated example. What's different about purely
comparison-based algorithms vs spreadsort is that only a comparison
object is required for purely-comparison-based algorithms, where
spreadsort requires indexing functions into the key (such as bitshift
or the equivalent of [] for strings). If you can write a comparison
function, you can write an equivalent indexing function, though that
doesn't mean it's easy. Many people have made the same assumption as
in your parallel_sort documentation, and I'm not sure what the best
way to clarify it is.

I think it'd be interesting to try a parallel spreadsort using your
parallelization approach, if your library gets substantial usage.

Do you need to have separate windows and linux cpp files? Much of the
benchmark code appears duplicated, and that becomes a maintenance
hassle over time. Please share as much of the code as you can.

I have some questions about your int_array object:
1) Why does it have these two functions:
int_array ( uint64_t K =0 );
int_array & operator = ( uint64_t K);

2) Why doesn't this print out the entire contents of the array:
std::ostream & operator << ( std::ostream & out , const int_array<NN> & N)

3) The object has a fairly expensive comparison function (there is no
early exit, it's like a hash comparison), it doesn't seem like a
problem, but I'm curious why you picked that comparison approach?

I look forward to reviewing it once you've completed cleaning it up
(including your docs). Did you include the tbb optimization to be
very fast with already-sorted data (I haven't run the library to
check)? I assume all your algorithms are either competitive in speed
or better in memory than their tbb and std library equivalents.

>
> I generate the doxygen documentation for all the code, but I think is very
> big and unnecessary for the final user of the library. I will change and
> put the documentation of the final functions.

They only need to see what is outside your detail namespace.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk