Boost logo

Boost :

Subject: Re: [boost] [gsoc] boost.simd news from the front.
From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2011-06-10 19:56:13


[ Note: I had written this before I saw Joel's answer, so there might be
a bit of duplication here. Sorry about this. }

On 10/06/2011 22:16, David A. Greene wrote:

> Almost everything the compiler needs to vectorize well that it does not
> get from most language syntax can be summed up by two concepts: aliasing
> and alignment.

That's not the problem as we see it.

Our argument is that vectorization needs to be explicitly programmed by
a human, because otherwise the compiler may not be able to entirely
rethink your algorithm to make it vectorizable.
Designing for vectorization can require to change your algorithm
completely and you might get completely different results due to running
with lower precision, running operations in different order (remember
floating-point arithmetic is not associative) etc., making it a
destructive conversion.

If the programmer is not aware of how to program a SIMD unit, he might
write code that is not vectorizable without realizing it.
By making it explicit, we can guarantee to that programmer that he gets
what he asked for, rather than depending on the whims of an optimizer.
Even in Boost.SIMD, some users claim we depend on the optimizer too much.

Also, the compiler seems to be unable to do this automatic vectorization
apart in the most trivial of cases.

The benchmarks show this. Just in the slides, there is a 4x improvement
when using std::accumulate (which is still a pretty trivial algorithm)
with pack<float> instead of float, and in both cases automatic
vectorization was enabled.

Actually, I think you even get a 6x improvement if you unroll the loop
within accumulate (due to pipelining effects), but I'm not sure of this.
Joel, you did that test, care to comment?

>
> I don't see how pack<> addresses the aliasing problem

It doesn't aim to.

A pack abstracts a SIMD register.
Aliasing is a problem tied to pointers, which are a separate thing.

Here are a non-exhaustive list of concerns to keep in mind when writing
high-performance code on a single CPU core with a SIMD unit:
  - SIMD instructions used
  - memory allocation, alignment, and padding
  - loop unrolling for pipelining effects
  - cache-friendliness of memory accesses (tied to alignment as well --
also important for when you go multicore)

The choice of Boost.SIMD is to separate all of those concerns.
pack<T> takes care of formalizing the register and generating the best
set of instructions available on the target CPU for what you ask.

You're still responsible for the rest. We try to provide tools to help
you in those tasks, but you don't have to use them if you don't want to.
That's called low coupling.

Actually, the NT2 library aims at providing a unified solution for the
entire thing and more (multicore, gpu, distributed...)
But Boost.SIMD is just the SIMD component, decoupled from the rest.

> in any way that is
> not similar to simply grabbing local copies of global data or
> parameters. Various C++ "restrict" extensions already address the
> latter. We desperately need something much better than "restrict" in
> standard C++.

Yes, you should use the restrict keyword (which is available in most C++
compilers in one fashion or another) to instruct the compilers your
pointers do not alias.

> Manycore is the future and parallel processing is the new
> normal.

I don't see the direct link.
Compilers can more easily tell that they can automatically parallelize a
loop when it uses restrict pointers, but if you're explicitly
parallelizing that loop (which is our approach -- explicit description
of the parallelism but automatic generation of the associated code),
then it's not strictly required.

> pack<> does address alignment, but it's overkill.

It doesn't address it, it requires it.
That's entirely the opposite, since that means that it's up to the
caller, not the callee, to address it.

> It's also
> pessimistic. One does not always need aligned data to vectorize, so the
> conditions placed on pack<> are too restrictive.

Loads and stores from an arbitrary pointer can be much slower, are not
portable, and are just a bad idea. This is not how you're meant to use a
SIMD unit, or even an ALU or a FPU.

Also, there is no compelling argument at all for using them, other than
automatic vectorization from arbitrary code, which is not what this
library is about.

It's just like reading an int from a pointer not aligned on a 4 byte
boundary (assuming std::alignment_of<int>::value is 4). While that's
allowed on x86 (at a certain runtime cost), it is not on a lot of other
architectures, and it's not allowed by the C++ standard either.

> Furthermore, the
> alignment information pack<> does convey will likely get lost in the
> depths of the compiler, leading to suboptimal code generation unless
> that alignment information is available elsewhere (and it often is).

It does not convey any alignment information. It's an abstraction for a
register. A register does not have an address, so any concept of
alignment makes no sense.
In truth, it may not be a register, since we allow the compiler to
transparently fall back to the stack, so that we don't have to go down
to the register allocation level, but when it does it (usually because
you've used more than 16 interdependent variables), it does it correctly.

pack generates the right instructions through use of intrinsics, and I
don't see what you mean about suboptimal code.

With SSE, there is one intrinsic for aligned load, and another for
unaligned load. We always use the aligned one.

Other instrinsics are obviously not affected by this, since alignment
does not matter for assembly instructions that operate on registers...

> I think a far more useful design of this library would be providing
> standard ways to assert certain conditions. For example:
>
> simd::assert(simd::is_aligned(&v[0], 16))

We already have this, though it's is_aligned<16>(&v[0])
is_aligned(&v[0]) also works, it checks against the strongest alignment
required for the largest SIMD unit (i.e. 32 on SandyBridge/Bulldozer,
and possibly 64 on future Intel processors).

> Provide simple things the compiler can recognize via pattern matching
> and we'll be a long way to getting the compiler to autovectorize.

That would be fine if that was what we wanted, and if we were working on
extending C++ compilers, but that's not the case at all.

We're not compiler writers. We just want a tool to express vectorization
in a way that is manageable, portable and future-proof, and that can
guarantee us that we'll use the CPU capabilities to the max.
Joel and I work in HPC research, and we need that performance. We can't
just hope that compilers will automagically do it for us.

You might as well write a naive matrix multiplication and expect the
compiler to generate you code with performance on par with that of an
optimized BLAS routine.
That's wishful thinking, albeit it surely would be nice (and I know
people that are working on this kind on thing -- in my research team even).

> I like simd::allocator to provide certain guarantees to memory managed
> by containers. That plus some of the asserts described above could help
> generic code a lot.

Bear in mind there is a lot of stuff undocumented because the library is
still being boostified and documentation being written.

> Other questions about the library:
>
> What's under the operators on pack<>? Is it assembly code?

It's intrinsic-level code. As showed in the slides, the type under the
hood of pack is the types that you use when you write such intrinsics.

We prefer using intrinsics than assembly, because it's more readable,
more portable and more optimization-friendly.

> I wonder how pack<T> can know the best vector length. That is highly,
> highly code- and implementation-dependent.

You tell it, one way or another, the SIMD ISAs that are available on
your architecture.

Then when you do pack<T, N>, it selects out of all of these the ones
that support vectors of NxT. (it does not support the case where
multiple matches are available)

If you do pack<T>, it selects out of all of these the ones that support
vectors of T, and takes the one with the biggest size.

Of course, that means that each new instruction set requires an addition
to the library.
But we use a fairly fancy dispatching system (based on overloading, ADL,
partial specialization and decltype) which allows to write things
generically, externally and in a nice extendable way.

> How does simd::where define pack<> elements of the result where the
> condition is false? Often the best solution is to leave them undefined
> but your example seems to require maintaining current values.

It's never undefined, since there is no lazy evaluation.
This is clearly explained in the slides.

if_else (or where or select, it has various names) is implemented in
terms of bitwise operations.

> How portable is Boost.simd? By portable I mean, how easy is it to move
> the code from one machine to another get the same level of performance?

Here is what we currently and what we aim to support:

Operating systems:
  - Linux
  - Mac OS X
  - Windows
  - AIX (in development)

Compilers:
  - GCC
  - Clang
  - MSVC10+ (due to the boost.typeof trick not working for us)

Processors:
  - x86 (SSE2, SSE3, SSSE3, SSE4a, SSE4.1, SSE4.2 -- AVX, XOP and FMA4
were also working at some point, but I believe they've been broken by
some recent changes)

Less stable:
  - PowerPC (AltiVec, Cell SPU, VMX128 (Xbox360) and VSX (POWER7), the
later two under way)
  - ARM (NEON in development)

Reference platform is GCC Linux x86-64 with SSE4.
There are still a couple of tests failing on MSVC/Windows, but we're
working on it.

Performance can be so-so with MSVC when aggressive inlining is disabled
(due to call convention and aliasing issues), but we're also working on
that.

> I don't mean to be too discouraging. But a library to do this kind of
> stuff seems archaic to me. It was archaic when Intel introduced MMX.
> If possible, I would like to see this evolve into a library to convey
> information to the compiler.

Since you don't seem to have particular real world experience with SIMD,
you don't really make a case for your judgment.

Experience has shown us that this kind of thing is necessary, and a lot
of tools, including some from Intel, AMD or Apple exist just to help you
write SIMD code.
Most if not all high-performance numerical computation libraries also
write SSE instructions directly and don't rely on the compiler to do it.
(BLAS, LAPACK, FFTW to name a few)

Also, if you want a vectorized sinus function, you have to write it.
Compilers don't have any, and if they did, they'd just use a library for it.

Our implementations of trigonometric functions based on Boost.SIMD are
for example faster (and more precise) than the ones provided with MKL,
the math kernel library by Intel for optimized math on their processors.
I believe that's in part because Boost.SIMD has allowed us to write it
easily; and it is portable and automatically scales to new architectures
and new vector sizes.
And we actually implement most of the C99 and TR1 functions and more, in
a IEEE754-compliant way, with very high precision, while they only
provide a handful.

We have many use cases where it has proven to allow us to do better
things in our research activity, and it's also at the core of the
technology of a start-up that's been acknowledged to be a disruptive
innovation by the local "competitiveness cluster".

So I can't really agree with the fact the library is archaic with such a
shallow analysis.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk