
On Tue, May 20, 2025 at 5:10 PM Joaquin M López Muñoz via Boost < boost@lists.boost.org> wrote:
That's a matter of opinion, I guess, but I'd rather have people not wanting the fallback write the compile-time check instead of the other way around. Sometimes you're not writing a final application but a library (say, on top of candidate Boost.Bloom), and you don't control compilation flags or target architecture.
I guess my concern is that people will assume reading documentation that if fast_ compiles it uses SIMD. But I see your point. To be clear what I mean here: *"but uses faster SIMD-based algorithms when SSE2, AVX2 or Neon are available". * User might think: my CPU supports AVX2, so surely it will use SIMD algorithms. But available here refers to compiler options(and obviously CPU support when binary is started), not just on CPU support. I know I am not telling you anything you do not know, I just think large percentage of users might misunderstand what available means.
I fail to see any run-time table initialization in your original snippet at https://godbolt.org/z/sYfc7rffa .
I am not a SIMD expert, but is this not creating those variables on stack? gcc asm vbroadcastsd ymm1, qword ptr [rip + .LCPI0_1] vmovaps xmm3, xmm1 vmovaps ymmword ptr [rsp + 64], ymm3 vpmovsxbq ymm4, dword ptr [rip + .LCPI0_4] vmovaps ymmword ptr [rsp + 128], ymm4 vmovaps ymmword ptr [rsp + 192], ymm1 vmovaps ymmword ptr [rsp + 256], ymm1 vmovaps ymmword ptr [rsp + 320], ymm1 vmovaps ymmword ptr [rsp + 384], ymm1 vmovaps ymmword ptr [rsp + 448], ymm1 But again my question was mostly about how certain those optimizations are for Bloom considering huge variety of compilers and compiler options, not to mention some future refactoring that might trip up the compiler optimizations. Now I may be just too paranoid, but those variables are not simple ints so I suspect that is why compilers have a problem computing them at compile time in my godbolt example, although as you said they do it successfully for Bloom, and I have verified that in my example code on my machine compiler optimizes it.