Boost logo

Boost :

Subject: Re: [boost] Going forward with Boost.SIMD
From: Andrey Semashev (andrey.semashev_at_[hidden])
Date: 2013-04-19 01:55:16


On Friday 19 April 2013 01:21:58 Marc Glisse wrote:
> On Thu, 18 Apr 2013, Andrey Semashev wrote:
> > 3. It supports division and modulus for integers?
>
> Why not?
>
> > Is it supported by any hardware?
>
> At least some special cases are, like division by a power of 2.

I think these special cases are better coded explicitly.

> And if the
> divisor is constant, you can also let the implementation handle turning it
> into a multiplication.

Does the compiler do that with user-defined operators (which are user-defined
in case of packs)? Or do you mean the implementation of the operator will
handle that? The latter means that the division will be very slow, but ok,
since the division is slow even in hardware...

> > 4. How would advanced operations be implemented, such as FMA and integer
> > madd? Is it through additional library provided functions? IMHO, the
> > availability of these operations is often crucial for performance of the
> > user's algorithm, if it is more complicated than just accumulating
> > integers.
> If you only want fma as a fast way to compute a+b*c, you could just let
> your compiler optimize an addition and a multiplication to fma. They are
> not bad at that. If you rely on the extra accuracy of fma, then library
> functions seem necessary.

According to my experience, compilers are reluctant at pattern matching the
intrinsics and replacing them with other intrinsics (which is a good thing).
So if the user's code a*b+c*d is equivalent to two
_mm_mullo_epi16/_mm_mulhi_epi16 and _mm_add_epi32 then that's what you'll get
in the output instead of a single _mm_madd_epi16. Note also that
_mm_madd_epi16 requires a special layout of its operands in xmm register
elements, which is also a blocker for the compiler optimization.

Regarding FMA, this is probably easier for compilers, but due to the
difference in accuracy I don't expect compilers to perform this optimization
lightly (i.e. without a specific compiler switch explicitly allowing it). And
a switch, being a global option, may not be suitable in every place of the
application. So having a way to explicitly express programmer's intention is
useful here too.

I think special opreations like FMA, madd, hadd/hsub, avg, min/max should be
provided as functions. Also, it might be helpful to be able to convert packs
to the compiler-specific types, like __m128i, and back to be able to use other
more special intrinsics that are not available as functions or interoperate
with inline assembler.

What I also forgot to ask is how the paper and Boost.SIMD handle overflowing
and saturating integer arithmetics? I assume, the operators on packs implement
overflowing operations since that's how scalar operations work. Is it possible
to do saturating operations then?


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk