[boost] Re: Proposing statsort for Boost.Sort

16 Mar 2026

      Hi people,

After giving it more thought, maybe this is worth considering:

- Injectable heavy type: if statsort best argument is to avoid comparisons
through interpolation, it could mean that its dream target applications are
distributions of data types that are actually costly to compare, for
example boost::multiprecision::cpp_dec_float<N>. But this would require
statsort not to gate its data type on std::is_arithmetic. A trait-based
approach (e.g., sortable_numeric<T> or something) would let users opt-in
while maintaining type safety.

- Injectable interpolation function: Users in physics/finance/ML often know
their data distribution or can estimate it, so why not allow users to
supply a CDF estimate that actually match their actual data distribution?
For Gaussian mixture, you would inject the mixture CDF. This would extend
statsort from "works on smooth distributions" to "works on any distribution
the user can model", and make it (hopefully) competitive on multimodal
distributions. Keeping linear as the default makes sense.

- Injectable bucket policy: the sqrt(n) buckets thing is hardcoded. But
injectable interpolation would open the door the gaussian mixtures,
lognormal mixtures and others naturally arising distributions where a
"natural" number of buckets would be the number of modes in the
distribution (often known in advance). I am not sure this would make
anything faster though, if more intuitive, so this would require some
benchmark.

With these extensions, statsort could shine compared to boost::sort on
benchmarks of billions of non-trivially (non uniform) generated heavy data
types. And I expect those patterns (mixtures etc) to be sufficiently
popular in physics and stuff to warrant inclusion in boost. But 1) Im no
Boost.Sort guy 2) all of this is conditional on extensions + benchmarks

Or maybe I completely misunderstood ;)

Best wishes,

Arno

On Sun, Mar 15, 2026 at 5:12 PM Arnaud Becheler <arnaud.becheler@gmail.com>
wrote:
...
Hi Francisco,
What is the generative process for the data submitted to the benchmark ?
Because if I understood correctly, stat sort is supposed to outperform in
quite specific distribution patterns. If you benchmark on other
distributions, I would actually take it as a stress tests for stat sort,
and a good thing that the difference in performance is minimal.
The next question would be “what data distribution could maximize statsort
perfs while minimize other algos perfs” and then benchmark on that
distribution. As suggested before, a mixture of k Gaussian distributions
could be a fair judge (but that would require investigating k, mu_k and
sigma_k).
I haven’t seen that in the presented benchmarks but maybe I misunderstood
:) Apologies if I did,
Best wishes,
Arnaud