Boost logo

Boost :

Subject: Re: [boost] Going forward with Boost.SIMD
From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2013-04-25 04:23:43

On 24/04/13 22:47, dag_at_[hidden] wrote:
> Mathias Gaunard <mathias.gaunard_at_[hidden]> writes:
>> Automatic parallelization will never beat code optimized by
>> experts. Experts program each type of parallelism by taking into
>> account its specificities.
> That is hyperbole. "Never" is a strong word.

A compiler can only perform the optimization that it has been engineered
to do.

A human can study the code and find the best optimizations available for
the algorithm at hand.

Until compilers become self-aware, they'll never be better than what a
human can do.

> Scalar predication hasn't changed the way people program because
> compilers do the if-conversion. As it should be with vectors.
> [...]
> I have trouble seeing how one would use the SIMD library to make it
> easier to write predicated vector code. Can you sketch it out?

As you said yourself, the if-conversion can be done by the compiler with
vectors just as easily as it can be done with scalars.

The library has a if_else(cond, a, b) function (similar to the ?:
ternary operator).

You cannot write
   x = foo;
   y = bar;

but you can write
x = if_else(cond, foo, x);
y = if_else(cond, bar, y);

In the current implementation on MIC if_else is implemented as a
predicated move. The compiler could optimize this by fusing the
predicate with whatever operation is done to compute a or b.
On SSE4 it uses a blend instruction. On other SIMD architectures it uses
a combination of two or three bitwise instructions.

In the library itself but not in the proposal there are also a couple of
other functions where an operation is directly masked or predicated,
like seladd and selsub which perform predicated addition/subtraction.

There is also a conditional store, because writing to memory is a
special thing.

> Predication allows much more effecient vectorization of many common
> idioms. A SIMD library without support for it will miss those idioms
> and the compiler auto-vectorizer will get better performance.

Not many SIMD programming idioms. Yes, SIMD programming has its own
idioms. Interestingly enough, apparently some of them are not always
known by the people designing the hardware!

> So the user has to write multiple versions of loops nests, potentially
> one for each target architecture? I don't see the advantage of this
> approach.

C++ supports generic and generative programming.
You don't actually write multiple versions. You just write one that is

As an example, there are also simple C++ utilities that you can use to
automatically unroll a loop by a given factor chosen at compile-time.
Not strictly SIMD-related though.

>> From my experience, it is still fairly reliable. There are differences
>> in performance, but they're mostly due to differences in the hardware
>> capabilities at solving a particular application domain well.
> Well yes, that's one of the main issues.

I don't see how it is an issue. Not all hardware has to be equal. Some
algorithms will also perform better on some types of hardware.

Consider GPUs, for example. The fact that they mostly don't have cache
means that the algorithms that you use for FFT or matrix multiplication
are entirely different than those used on a CPU. A compiler would have
no way of generating the optimal algorithm from the other, that's
something that must be done manually.

Likewise if you write an algorithm that relies a lot on division
performance, when moving to an architecture without native division it
will be slower. You could use a different algorithm that uses different

Boost list run by bdawes at, gregod at, cpdaniel at, john at