Boost logo

Boost :

Subject: Re: [boost] Going forward with Boost.SIMD
From: dag_at_[hidden]
Date: 2013-04-24 17:00:34

>> All of the scalar and complex arithmetic using simple binary operators
>> can be easily vectorized if the compiler has knowledge about
>> dependencies. That is why I suggest standardizing keywords, attributes
>> and/or pragmas rather than a specific parallel model provided by a
>> library. The former is more general and gives the compiler more freedom
>> during code generation.
> It seems like the auto-parallelizing compiler is constantly just a
> couple of years away. I know there is progress, but apparently the
> complexity of today's architectures counteracts this.

Such compilers exist today. gcc and clang are not among them, but they
are improving.

Compilers exist in the field today that generate CPU/GPU code that
outperforms hand-coded CUDA. Compilers exist in the field today that
vectorize and parallelize code that outperforms hand-parallelized code.

There will always be cases where hand-tuning will win. The question is
whether standardizing a library to help these cases which exist in a
narrow model of parallelism is a good idea.

Hand-tuned scalar code can beat compiler-generated code yet we don't
advocate people write in asm all the time. Hand-written vector code is
really just a slightly higher form of asm. Even with operator
overloading the user still has to explicitly think about strip mining,
hardware capabilities and data arrangement.

I would much rather see array syntax notation in standard C++ than a
library that provides one restricted form of parallelism.

A SIMD library is fine, maybe even great! But not in the standard.

>> But see that's exactly the problem. Look at the X1. It has multiple
>> levels of parallelism. So does Intel MIC and GPUs. The compiler has to
>> balance multiple parallel models simultaneously. When you hard-code
>> vector loops you remove some of the compiler's freedom to transform
>> loops and improve parallelism.
> But isn't the current programming model broken? If you let the
> programmer write loops which the compiler will aim to parallelize,
> then the programmer will still always think of the iterations of
> running sequentially, thus creating an "impedance mismatch".

Or it's providing a level of abstraction convenient for the user.

> Programming models such as Intel's ispc or Nvidia's CUDA fare so well
> because they exhibit an acceptable amount of parallelism to the user,
> while simultaneously maintaining some leeway for the compiler.

As mentioned before, CUDA is on its way out for many codes. Yes, there
are models of parallelism that have proven useful. Co-Array Fortran is
one example. Yhese models are generally implemented in languages in a
way that provides freedom to the compiler to optimize as it sees fit.
Putting too many constraints on implementation doesn't work well.

>> A library-based short vector model like the SIMD library is very
>> non-portable from a performance perspective. It is exactly for this
>> reason that things like OpenACC are rapidly replacing CUDA in production
>> codes. Libraries are great for a lot of things. General parallel code
>> generation is not one of them.
> CUDA is being rapidly replaced by things like OpenACC? Hmm, in my
> world people are still rubbing their eyes as the slowly realize that
> this "#pragma omp parallel for" gives them poor speedups, even on
> quad-core UMA nodes. And seeing how "well" the auto-magical offload
> mode on MIC works, they are very suspicious of things like OpenACC.

Knights Corner is not a particularly good implementation of the MIC
concept. That has been known for a while. It's a step on a path. I
referenced it for the ISA concepts, not the microarchitecture

CUDA *is* being replaced by OpenACC in our cutomers' codes. Not
overnight, but every month we see more use of OpenACC.

OpenMP has some real deficiencies when it comes to efficient
parallelization. That is one reason I didn't mention it in my list of
suggestions. Still, it is quite useful for certain types of codes.

I'm not advocating for any particular parallel model. I'm advocating
for the tools to let the compiler choose the most appropriate model for
a given piece of code.


Boost list run by bdawes at, gregod at, cpdaniel at, john at