Boost logo

Boost :

Subject: Re: [boost] Going forward with Boost.SIMD
From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2013-04-24 15:57:08


On 24/04/13 20:00, dag_at_[hidden] wrote:

> All of the scalar and complex arithmetic using simple binary operators
> can be easily vectorized if the compiler has knowledge about
> dependencies. That is why I suggest standardizing keywords, attributes
> and/or pragmas rather than a specific parallel model provided by a
> library. The former is more general and gives the compiler more freedom
> during code generation.

> But see that's exactly the problem. Look at the X1. It has multiple
> levels of parallelism. So does Intel MIC and GPUs. The compiler has to
> balance multiple parallel models simultaneously. When you hard-code
> vector loops you remove some of the compiler's freedom to transform
> loops and improve parallelism.

Automatic parallelization will never beat code optimized by experts.
Experts program each type of parallelism by taking into account its
specificities.
A one-size-fits-all model for all kinds of parallelism is nice, but
limited; using a dedicated tool for each type of parallelism is the
right approach for maximum performance.

While it could be argued that experts should use the lowest level API to
reach their goals, such libraries can still make experts much more
productive.

An interesting point in favor of a library is also memory layout. A C++
compiler cannot change the memory layout on its own to make it more
friendly to vectorize. By providing the right types and primitives to
the user, he is made aware of the issues at hand and empowered with the
ability to explicitly state how a given algorithm is to be vectorized.

> For specialized operations like horizontal add, saturating arithmetic,
> etc. we will need intrinsics or functions that will be necessarily
> target-dependent.

The proposal suggests providing vectorized variants of all mathematical
functions in the C++ standard (the Boost.SIMD library covers C99, TR1
and more). That's quite a lot of functions.
Should all these functions be made compiler built-ins? That doesn't
sound like a very scalable and extensible approach.
You'll probably want to use different algorithms for the SIMD variants
of these functions, so having the compiler auto-vectorize the scalar
variant doesn't sound like a terrible idea either.

> Vector masks fundamentally change the model. They drastically affect
> control flow.

Some processors have had predication at the scalar level for quite some
time. It hasn't drastically changed the way people program.

It is similar to doing two instructions in one (any instruction can also
do a blend for free), and optimizing those instructions done separately
into one is something that a compiler should be able to do pretty well.
It doesn't sound very unlike what a compiler must do for VLIW codegen to
me, but then I have little knowledge of compilers.

The fact that it is the library doesn't mean that the compiler shouldn't
perform on vector types the same optimizations that it does on scalar ones.

While I can see the benefit of this feature for a compiler that wants to
generate SIMD for arbitrary code, dedicated SIMD code will not depend on
this too much that it cannot be covered by a couple of additional functions.

> Longer vectors can also dramatically change the generated code. It is
> *not* simply a matter of using larger strips for stripmined loops. One
> often will want to vectorize different loops in a nest based on the
> hardware's maximum vector length.

I don't see what the problem is here.
This is C++. You can write generic code for arbitrary vector lengths. It
is up to the user to use generative programming techniques to make his
code depend on this parameter and be portable. The library tries to make
this as easy as possible.

> A library-based short vector model like the SIMD library is very
> non-portable from a performance perspective.

 From my experience, it is still fairly reliable. There are differences
in performance, but they're mostly due to differences in the hardware
capabilities at solving a particular application domain well.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk