|
Boost : |
Subject: Re: [boost] Going forward with Boost.SIMD
From: dag_at_[hidden]
Date: 2013-04-24 14:00:40
Mathias Gaunard <mathias.gaunard_at_[hidden]> writes:
> The proposed SIMD library supports many architectures and has been
> deployed in several pieces of software, from academia to production
> software, with complex and varied usage patterns, and has given
> significant performance gains where optimizing compilers didn't give
> much even when loops were specifically written to be
> optimizer-friendly.
> I wouldn't call it an inefficient model.
I said *relatively* inefficient. It's the best we have on commodity
processors right now, unfortunately. Really, investigate past vector
architectures. I would start with the Cray X1 or X2 because I am biased
and it's a pretty straightforward RISC-like vector ISA. It has a lot of
features implemented based on decades of vectorization and
parallelization experience.
I'm not knocking the SIMD library itself. I certainly see how it would
be a useful bridge between current and future architectures. I just
don't think we should standardize something that's going to rapidly
change.
All of the scalar and complex arithmetic using simple binary operators
can be easily vectorized if the compiler has knowledge about
dependencies. That is why I suggest standardizing keywords, attributes
and/or pragmas rather than a specific parallel model provided by a
library. The former is more general and gives the compiler more freedom
during code generation.
For specialized operations like horizontal add, saturating arithmetic,
etc. we will need intrinsics or functions that will be necessarily
target-dependent.
> It doesn't aim to do all sorts of parallelization, just the SIMD
> part. Other parallelization and optimization tasks must be done in
> addition to its usage.
But see that's exactly the problem. Look at the X1. It has multiple
levels of parallelism. So does Intel MIC and GPUs. The compiler has to
balance multiple parallel models simultaneously. When you hard-code
vector loops you remove some of the compiler's freedom to transform
loops and improve parallelism.
>> See Intel MIC. This stuff is coming much faster than most people
>> realize. From where I sit (developing compilers professionally for
>> vector architectures), the path is clear and it is not the current
>> SSE/AVX model.
>
> I wouldn't say that MIC is that different from SSE/AVX.
> Scatter, predication, conversion on load/store. That's just extras, it
> doesn't fundamentally change the model at all.
Vector masks fundamentally change the model. They drastically affect
control flow.
Longer vectors can also dramatically change the generated code. It is
*not* simply a matter of using larger strips for stripmined loops. One
often will want to vectorize different loops in a nest based on the
hardware's maximum vector length.
A library-based short vector model like the SIMD library is very
non-portable from a performance perspective. It is exactly for this
reason that things like OpenACC are rapidly replacing CUDA in production
codes. Libraries are great for a lot of things. General parallel code
generation is not one of them.
-David
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk