Boost logo

Boost :

Subject: Re: [boost] Integer sorting
From: Fenil Mehta (fenilgmehta_at_[hidden])
Date: 2019-01-22 09:09:13


I communicate with this E-mail account only: fenilgmehta_at_[hidden]
I have no other E-mail account.

Please forgive me for my mistyped time complexity, I have corrected it to
O(nk). I am not quite good at calculating the time complexity. Hence there
may be an error in my time complexity but the sorting idea is good.

I have been working on my project for the past 6 months and it was for the
first time I heard about ska_sort. When I read its implementation, it had
already implemented what I was planning to do in the near future.
I could not find ska_sort in the Boost library which I had recently
installed. If someone could throw some light on this, then it would be
greatly appreciated.
I completely agree that ska_sort is highly optimized. For all arrays of
primary data types(like int, float, char) or small objects(like
pair<int,int>) ska_sort performance is the best. However, for arrays of
large objects, ir_sort is better.

The test cases are generated completely randomly. I do not know the test
cases at all when I make the comparison. The source code for test case
generation is there on the GitHub[1], you can scrutinize it and a snippet
is as follows:
    // here I pass a vector for arr
    template<typename T>
    void fillRandArray(T &arr, const int64_t &low, const int64_t &high,
const int64_t &randStart, const int64_t &randEnd) {
        std::random_device rd;
        std::mt19937_64 gen(rd());
        std::uniform_int_distribution<ArrayIndexType> dis(randStart,
randEnd);

        for (int64_t i = low; i <= high; i++) arr[i] = dis(gen);
    }

As of my claims, let me give the clear details of the input.
ir_sort is faster when I tested it for an array of objects of size =
1000*64 bits, and array length = 2000 only.
I have to test it for other conditions.

Steven, at present I have only implemented the swapping optimization for
integers only. ir_sort is in the beginning phase. Once it finalized for
integers, I will plan a solution for other data types.
I have not looked much at the worst case distribution. At present, I am
testing it for random distributions. I will have a look at the worst
scenario and mostly sorted data.
For memory overhead, it will require two integer arrays of size "n", and
probably it can be reduced to a single array of size "n".
I agree that these algorithms have been designed keeping generality as the
focus. However, generally, objects are indexed based on integers, keeping
this in mind I decided to write ir_sort.

The reason it is way faster for large object arrays is that I only make "n"
swaps to sort the array. I use indexes to reduce the overhead of swapping.
I believe the swapping optimization I have written is not limited to
ir_sort, it can be used along with other sorting algorithms to improve
their running speed for large objects.

Francisco, thank you for your quick response. I wait for your feedback.

[1]
https://github.com/fenilgmehta/Fastest-Integer-Sort/blob/b11ae9ddb3c9cc7e9d141817bb31409bf51e59c7/test/raw_data_generator_integer.cpp#L155
[2] https://github.com/fenilgmehta/Fastest-Integer-Sort/

Regards,
Fenil Mehta

On Mon, Jan 21, 2019 at 8:18 PM Steven Ross via Boost <boost_at_[hidden]>
wrote:

> Fenil,
>
> Based on the description, this looks like spreadsort without the worst case
> analysis, and with a new swapping optimization (I know there is room for
> improvement in the swapping). I expect this algorithm will perform
> significantly worse than std::sort (or pdqsort) in the worst-case
> situations, without applying similar worst-case analysis. spreadsort is
> comparable to std::sort in the worst case distribution for spreadsort
> because of careful analysis and threshold optimization. If you tweaked the
> spreadsort constants
> <
> https://github.com/boostorg/sort/blob/master/include/boost/sort/spreadsort/detail/constants.hpp
> >
> I
> think you should be able to get comparable performance to your algorithm,
> minus the impact of your swapping optimization.
> How does your swapping optimization perform on mostly-sorted data?
> spreadsort does well in this common case relative to std::sort, which is
> somewhat tricky with in-place radix sorting.
> You should try seeing how your algorithm performs on the distributions in
> https://github.com/boostorg/sort/blob/master/example/alrbreaker.cpp and
> https://github.com/boostorg/sort/blob/master/example/binaryalrbreaker.cpp,
> tweaking those to hit your worst case.
> What's the memory overhead?
> What's the threshold size at which it starts becoming faster than std::sort
> and pdqsort? comparison-based sorting is surprisingly fast on small lists
> of integers that fit on L1 cache.
> How well does it handle common prefixes in string sorting? string_sort is
> optimized to handle that case extremely quickly. It's a serious
> performance issue if you don't optimize for it.
>
> I wish more people realized the generality of these algorithms; if you
> really want a fast sort, you can use spreadsort (or an equivalent
> algorithm), you just need to define a way to index into the key. This was
> published a while ago in Engineering Radix Sort, but most people use
> comparison sorts because they're easier to use (and for most applications,
> sorting compute time is minor).
>
> On Mon, Jan 21, 2019, 3:29 AM Francisco José Tapia via Boost <
> boost_at_[hidden]> wrote:
>
> > Hi Fenil,
> >
> >
> > I am Francisco Tapia, one of the maintainers of the Sort Library.
> >
> > I had been watching your code, and find it interesting. But give me a
> week
> > to say you something. These days I am very busy, and I expect to have
> time
> > the next weekend, and examine and test your code.
> >
> > Yours
> >
> >
> > Francisco Tapia
> >
> > El lun., 21 ene. 2019 a las 8:41, Richard Hodges via Boost (<
> > boost_at_[hidden]>) escribió:
> >
> > > On Sat, 29 Dec 2018 at 03:36, Fenil Mehta via Boost <
> > boost_at_[hidden]
> > > >
> > > wrote:
> > >
> > > > I have written a sorting algorithm which is way faster than std::sort
> > for
> > > > all array size.
> > >
> > >
> > > "Extraordinary claims require extraordinary evidence" - Carl Sagan
> > >
> > > https://en.wikipedia.org/wiki/Sagan_standard
> > >
> > >
> > >
> > >
> > > > The link is as follows:
> > > > https://github.com/fenilgmehta/Fastest-Integer-Sort
> > > >
> > > > Some guidance would be appreciated on how to improve the project
> > > structure
> > > > and the steps to get it included in boost.
> > > >
> > > > Regards,
> > > > Fenil Mehta
> > > >
> > > > _______________________________________________
> > > > Unsubscribe & other changes:
> > > > http://lists.boost.org/mailman/listinfo.cgi/boost
> > > >
> > >
> > >
> > > --
> > > Richard Hodges
> > > hodges.r_at_[hidden]
> > > office: +442032898513
> > > home: +376841522
> > > mobile: +376380212 (this will be *expensive* outside Andorra!)
> > > skype: madmongo
> > > facebook: hodges.r
> > >
> > > _______________________________________________
> > > Unsubscribe & other changes:
> > > http://lists.boost.org/mailman/listinfo.cgi/boost
> > >
> >
> > _______________________________________________
> > Unsubscribe & other changes:
> > http://lists.boost.org/mailman/listinfo.cgi/boost
> >
>
> _______________________________________________
> Unsubscribe & other changes:
> http://lists.boost.org/mailman/listinfo.cgi/boost
>


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk