Hi people, After giving it more thought, maybe this is worth considering: - Injectable heavy type: if statsort best argument is to avoid comparisons through interpolation, it could mean that its dream target applications are distributions of data types that are actually costly to compare, for example boost::multiprecision::cpp_dec_float<N>. But this would require statsort not to gate its data type on std::is_arithmetic. A trait-based approach (e.g., sortable_numeric<T> or something) would let users opt-in while maintaining type safety. - Injectable interpolation function: Users in physics/finance/ML often know their data distribution or can estimate it, so why not allow users to supply a CDF estimate that actually match their actual data distribution? For Gaussian mixture, you would inject the mixture CDF. This would extend statsort from "works on smooth distributions" to "works on any distribution the user can model", and make it (hopefully) competitive on multimodal distributions. Keeping linear as the default makes sense. - Injectable bucket policy: the sqrt(n) buckets thing is hardcoded. But injectable interpolation would open the door the gaussian mixtures, lognormal mixtures and others naturally arising distributions where a "natural" number of buckets would be the number of modes in the distribution (often known in advance). I am not sure this would make anything faster though, if more intuitive, so this would require some benchmark. With these extensions, statsort could shine compared to boost::sort on benchmarks of billions of non-trivially (non uniform) generated heavy data types. And I expect those patterns (mixtures etc) to be sufficiently popular in physics and stuff to warrant inclusion in boost. But 1) Im no Boost.Sort guy 2) all of this is conditional on extensions + benchmarks Or maybe I completely misunderstood ;) Best wishes, Arno On Sun, Mar 15, 2026 at 5:12 PM Arnaud Becheler <arnaud.becheler@gmail.com> wrote:
Hi Francisco,
What is the generative process for the data submitted to the benchmark ?
Because if I understood correctly, stat sort is supposed to outperform in quite specific distribution patterns. If you benchmark on other distributions, I would actually take it as a stress tests for stat sort, and a good thing that the difference in performance is minimal.
The next question would be “what data distribution could maximize statsort perfs while minimize other algos perfs” and then benchmark on that distribution. As suggested before, a mixture of k Gaussian distributions could be a fair judge (but that would require investigating k, mu_k and sigma_k).
I haven’t seen that in the presented benchmarks but maybe I misunderstood :) Apologies if I did,
Best wishes, Arnaud