Boost logo

Boost :

Subject: Re: [boost] [SORT] Parallel Algorithms
From: Steven Ross (spreadsort_at_[hidden])
Date: 2014-12-22 22:34:19


Francisco,

On Mon Dec 22 2014 at 5:04:17 PM Francisco José Tapia <fjtapia_at_[hidden]>
wrote:

> > I had been working about the memory used by the different algorithms,
> > specially the stable algorithms. I developed a low memory algorithm
> > circular_stable_sort, which use only a 10% more of memory, and the speed
> > decrease around 15% with small objects and with big objects is similar to
> > the merge_sort
>
> Some people may find that useful. What is the worst-case computational
complexity? Is it a well-known algorithm? The bar is going to be higher
(may require a full review) for previously unknown algorithms.

> > I was looking too, the TimSort algorithm. I found a C++ implementation
> > in https://github.com/gfx/cpp-TimSort .
> >
> > I inserted this algorithm in the benchmark programs. With small objects,
> > the speed is slower than the stable sort and merge_sort, and have a good
> > performance with big objects, greater than 64 bits. It's logic the
> decision
> > of Java and Python, because they must order objects, usually non
> contiguous
> > and bigger than the naked number.
> >
> > For the measure of the memory used by the program I use the command
> > /usr/bin/time -v program_to_execute. I take the benchmark_algorithm_04
> > (100000000 uint64_t numbers), and comment all the invocations to the
> > algorithms , except one, compile, and run in a terminal with the command
> > /usr/bin/time -v benchmark_sort_04.
> >
> > I repeat this process with all the algorithms for to know the memory
> > used by each algorithm.
>

I suggest changing the struct you use for benchmarking to this:
template <uint32_t NN>
struct reburcio
{
  uint64_t M[NN];
  reburcio () {
    init();
  }

  void init() {
    for (int i = 0; i < NN; ++i) {
      M[i] = my_rand();
    }
  }

  bool operator < ( const reburcio &R) const{
    for (int i = 0; i < NN; ++i) {
      if (M[i] != R.M[i]) {
        return M[i] < R.M[i];
      }
    }
    return false;
  }
};

That way it's using the default copy and assignment operators, which copy
over the entire datastructure, and it initializes and compares all elements
of the array. It appears that the benchmark you sent me only copies over
and compares the first element in the array, which might have caused some
of the large-element differences in performance you were seeing.

With this data structure, countertree intro_sort is competitive on my linux
system with std::sort on fully random data, and countertree parallel_sort
is competitive for 8, 16, and 256 bytes per element, but significantly
underperforms (10-15%) tbb for 32 and 64 bytes.

> > As you can see with a 256 bytes of size the indirect
> > parallel_stable_sort is 3 times faster. And the indirect_sort is a 25%
> > faster.
>

Yes, indirect sort should eventually be faster, given a large enough
element, especially if indirect sort is done with the fastest underlying
algorithm available.

> >
> > The final solution, if at end is taken, will be a hybrid algorithm
> > depending of the size of the elements
>

That sounds reasonable for fixed-size elements. What about variable-length
elements, like strings? Will this only be for fixed-size elements?

> > I propose you the next. Let me 2 or 3 weeks and I will prepare a big
> > benchmark with all the algorithms, including the indirect version of the
> > stable and unstable sort. We will run the benchmarks over different
> > machines, and according the results, take a decision about if we propose
> > something to Boost, and which algorithm must be used in the hybrid
> version.
> >
> > If, at end, we decide don't propose nothing, I only can say this had
> > been a nice and exciting experience
>

I think indirect sorting is promising.

> >
> > Mi intention is not to impose my code, I want provide to the final users
> > the best solution, if exist. With my code or with the code of other. I am
> > the first to clap the winner.
> >
> > We are doing this because this parallel implementation don't need any
> > other library, only the standard. There are compilers without OpenMP or
> TBB
> > support, or simply, the user don't want to load them for to make a sort.
>
> Agreed, but OpenMP is very common (even Android supports it)!, so systems
without OpenMP are a very small niche in terms of total systems out there.
TBB is supported on intel-compatible processors, and the vast majority of
deployed processors where parallel sorting makes sense (won't drain the
battery really quickly) are intel-compatible. If we're going to provide an
alternative, it should cost very little in terms of performance, and
ideally provide some kind of benefit besides eliminating a dependency.

> > I have in mind, several improvements of the algorithms. I will codify
> > and test, and if run well, I include in the algorithms. I want, too,
> > simplify the interface of indirect_sort, for to be used in a easy way by
> > any sort algorithm. If you are interested, I will prepare too a brief
> > document with the explanation and the internal algorithm.
>
> I'd like to see that document.
Along with testing a struct that is an array of ints, you should probably
test variable-length string sorting, as string sorting is very common.

Merry Christmas,

Steve


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk