Boost logo

Boost :

Subject: Re: [boost] [gsoc] boost.simd news from the front.
From: David A. Greene (greened_at_[hidden])
Date: 2011-06-11 11:30:58


Mathias Gaunard <mathias.gaunard_at_[hidden]> writes:

> [ Note: I had written this before I saw Joel's answer, so there might
> be a bit of duplication here. Sorry about this. }
>
> On 10/06/2011 22:16, David A. Greene wrote:
>
> Our argument is that vectorization needs to be explicitly programmed
> by a human, because otherwise the compiler may not be able to entirely
> rethink your algorithm to make it vectorizable.

The human may have to restructure some of the code, but a good compiler
can do a lot of restructuring and it can certainly do the mechanics of
generating vector code. In many cases all that's needed is a directive
here or there to help the compiler understand that there isn't a
dependency when it can't know that statically.

> Designing for vectorization can require to change your algorithm
> completely and you might get completely different results due to
> running with lower precision, running operations in different order
> (remember floating-point arithmetic is not associative) etc., making
> it a destructive conversion.

Yes, the programmer needs to keep all of these things in mind no matter
how vector code is generated.

> If the programmer is not aware of how to program a SIMD unit, he might
> write code that is not vectorizable without realizing it.

If that's the case, then the programmer is just as likely to write
explicit vector code that is wrong. In other words, he will vectorize
when it is illegal to do so.

> Also, the compiler seems to be unable to do this automatic
> vectorization apart in the most trivial of cases.

Examples?

> The benchmarks show this. Just in the slides, there is a 4x
> improvement when using std::accumulate (which is still a pretty
> trivial algorithm) with pack<float> instead of float, and in both
> cases automatic vectorization was enabled.

With which compilers? Again, I can see utility for boost.simd if the
compiler doesn't know how to vectorize well. I'm arguing that you can't
make the case that boost.simd is always necessary or even necessary in
most cases.

> Here are a non-exhaustive list of concerns to keep in mind when
> writing high-performance code on a single CPU core with a SIMD unit:
> - SIMD instructions used

Should be the job of vector codegen in the compiler, if the compiler
supports it.

> - memory allocation, alignment, and padding

A few directives here and there solves this.

> - loop unrolling for pipelining effects

The compiler should handle this, perhaps with a few directives to help.

> - cache-friendliness of memory accesses (tied to alignment as well --
> also important for when you go multicore)

Ditto.

> The choice of Boost.SIMD is to separate all of those concerns.
> pack<T> takes care of formalizing the register and generating the best
> set of instructions available on the target CPU for what you ask.

But how does the programmer know what is best? It changes from
implementation to implementation.

>> Manycore is the future and parallel processing is the new
>> normal.
>
> I don't see the direct link.

The direct link is that compilers are going to have to get good at this.
Some already are.

> Compilers can more easily tell that they can automatically parallelize
> a loop when it uses restrict pointers, but if you're explicitly
> parallelizing that loop (which is our approach -- explicit description
> of the parallelism but automatic generation of the associated code),
> then it's not strictly required.

And you're going to end up with code that is not performance portable.

>> pack<> does address alignment, but it's overkill.
>
> It doesn't address it, it requires it.

That's overkill. Alignment often isn't required to vectorize.

>> It's also
>> pessimistic. One does not always need aligned data to vectorize, so the
>> conditions placed on pack<> are too restrictive.
>
> Loads and stores from an arbitrary pointer can be much slower, are not
> portable, and are just a bad idea. This is not how you're meant to use
> a SIMD unit, or even an ALU or a FPU.

With AMD and Intel's latest offerings alignment is much less of a
performance issue.

> pack generates the right instructions through use of intrinsics, and I
> don't see what you mean about suboptimal code.

Because the code the programmer writes may not be the best code for
the given microarchitecture. This stuff is often non-obvious.

>> I think a far more useful design of this library would be providing
>> standard ways to assert certain conditions. For example:
>>
>> simd::assert(simd::is_aligned(&v[0], 16))
>
> We already have this, though it's is_aligned<16>(&v[0])
> is_aligned(&v[0]) also works, it checks against the strongest
> alignment required for the largest SIMD unit (i.e. 32 on
> SandyBridge/Bulldozer, and possibly 64 on future Intel processors).

32 is not always the right answer on SandyBridge/Bulldozer because
256 bit vectorization is not always the right answer.

What's the vector length of a pack<int> on Bulldozer?

> Joel and I work in HPC research, and we need that performance. We
> can't just hope that compilers will automagically do it for us.

I also work in HPC. Everyone I know relies on compilers.

> You might as well write a naive matrix multiplication and expect the
> compiler to generate you code with performance on par with that of an
> optimized BLAS routine.

That is in fact what is happening via autotuning. Yes, in some cases
hand-tuned code outperforms the compiler. But in the vast majority of
cases, the compiler wins. If you want to use boost.simd for the cases
where it does not, that is entirely appropriate. But again, don't claim
that boost.simd is always the answer because compilers are dumb. They
are not.

>> I wonder how pack<T> can know the best vector length. That is highly,
>> highly code- and implementation-dependent.

> If you do pack<T>, it selects out of all of these the ones that
> support vectors of T, and takes the one with the biggest size.

That's often the wrong answer.

>> I don't mean to be too discouraging. But a library to do this kind of
>> stuff seems archaic to me.

> Since you don't seem to have particular real world experience with
> SIMD, you don't really make a case for your judgment.

My case comes from years of experience.

> Most if not all high-performance numerical computation libraries also
> write SSE instructions directly and don't rely on the compiler to do
> it. (BLAS, LAPACK, FFTW to name a few)

Yes, and boost.simd could be a good solution for those cases. But I
don't think it's a good solution for the programmer to write his
application.

> Also, if you want a vectorized sinus function, you have to write
> it. Compilers don't have any, and if they did, they'd just use a
> library for it.

Yes. Again, boost.simd might help, but it does not relieve the burden
of writing multiple implementations of a library function for multiple
hardware implementations.

> So I can't really agree with the fact the library is archaic with such
> a shallow analysis.

It may be useful in some instances, but it is not a general solution to
the problem of writing vector code. In the majority of cases, the
programmer will want the compiler to do it because it does a very good
job.

                             -Dave


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk