Subject: Re: [boost] [OT?] SIMD and Auto-Vectorization (was Re: How to structurate libraries ?)
From: Dean Michael Berris (mikhailberis_at_[hidden])
Date: 2009-01-20 02:25:20
On Tue, Jan 20, 2009 at 9:01 AM, Patrick Mihelich
> I'm not sure how or why this turned into a discussion of the general
> concurrency problem in C++. This is interesting, certainly, but should
> probably be considered a separate topic from SIMD and auto-vectorization. It
> doesn't seem very fair to me to criticize a SIMD library for fighting one
> battle instead of winning the whole war; you could make similar criticisms
> of Boost.Thread or Boost.MPI, yet these are useful libraries.
I'm not about being fair: I am about tackling the issue from a larger
I am not criticizing a SIMD library in the sense that I don't think
other people will want to use it -- I personally think that being
wrapped in a DSEL makes it clever, but nonetheless the scope of the
problem is too narrowly defined. That means, *I* don't think putting
these groups of operations together needs to be that complicated.
> The sense I'm getting from this discussion is that SIMD code generation is
> uninteresting, and that we should stick our heads in the sand and wait for
> the Sufficiently Smart Compilers to come along. OK, I sympathize with this.
> Writing functions using SIMD intrinsics is a bit of a distraction from the
> computer vision tasks I actually care about, but I have time budgets to meet
> and usually some work of this type has to be done.
Actually, I think you're missing the point (at least from what I'm saying).
I'm saying SIMD code generation ought to be the job of the compiler(s)
for the platforms where they make sense. Now *if* you wanted to be
able to specifically make it work, you can do something that others
have already been doing: adding a layer of indirection.
Now this layer of indirection can be as clever as a DSEL (which I
don't think it needs to be) or as simple as a function that switches
implementations at compile time using preprocessor macros or some
other facility. Now if you needed to optimize a set of operations that
are specific to your field (like for example, applying a blur on a set
of pixels represented by a set of floats) then I wouldn't find it hard
to imagine having that specific part hand-optimized for your need.
Does this need another library? I wager to say it doesn't -- it's like
saying you're implementing a DSEL in C++ to do simple mathematics. Now
if you want to write your own DSEL for image manipulation and perform
the transformation in the background to use the SIMD instructions for
a specific platform then fine that would be great -- and the details
of the implementation would be just that, details, that I don't see a
need for a special library just for SIMD instructions *especially*
since the compilers will be able to automatically vectorize the parts
that can easily be vectorized. (I use the term "easily" here very
loosely because that depends on the compiler you're using).
> IMO, waiting for compiler technology is neither pragmatic in the short-term
> nor (as I argued in the other thread) conceptually correct. If you look at
> expression-template based linear algebra libraries like uBlas and Eigen2,
> these are basically code generation libraries (if compilers were capable of
> complicated loop fusion optimizations, we might not need such libraries at
> all). Given an expression involving vectors, it's fairly mechanical to
> transform it directly into optimal assembly. Whereas at the level of
> optimizing IR code, reasoning about pointers and loops is rather
> complicated. Are the pointers 16-byte aligned? Does it make sense to
> partially unroll the loop to exploit data parallelism? What transformations
> can (and should) we make when traversing a matrix in a double loop? There
> are all sorts of obstacles the compiler must overcome to vectorize code. Why
> not handle these issues in the stage of compilation with the best
> information and clearest idea of what the final assembly code should look
> like - in this case at the library level with meta-programming?
I am not against it -- now if you're talking about fixing uBlas to
make it aware of the capabilities of a platform and perform the
transformations necessary to be able to leverage vendor-specific
libraries, then I'm all for it. Do I think it needs a special
DSEL/library for doing so? *That* is what I'm questioning.
The reason I like the thought of letting the compiler do the
auto-vectorization for me is that the compiler already knows about my
code and the transformations it's going to do to make it work --
there's no reason for a compiler not to be able to know these details
you talk about. It's not even absurd for a compiler to turn certain
code patterns to use OpenMP to parallelize parts of the solution and
then at even lower levels even create the SIMD code to leverage the
SIMD extensions of the compiler it's going to use.
> The fact of the matter is that compilers do not generate optimal
> SIMD-accelerated code except in the simplest of cases, and so we end up
> using SIMD intrinsics by hand. Frankly I don't expect this to change
> dramatically anytime soon; I'm no compiler expert, but my impression is that
> some complicated algebraic optimizations (for which C++ is not very suited)
> are necessary.
And I don't question that fact that compilers do not yet generate
optimal SIMD-accelerated code -- especially if you're talking about
GCC. I might be surprised to hear the same about Intel's compiler but
I know that it does perform quite advanced optimizations on the code
to leverage SSE on the platforms it supports.
If it's a matter of arranging your C++ so that it can be automatically
vectorized by a less sophisticated yet auto-vectorizing compiler, then
I would think that would be a more achievable goal (and easier?) to
accomplish than releasing/maintaining a SIMD-only DSEL/library.
> Using SIMD intrinsics by hand is nasty in assorted ways.
> The syntax is not
> standard across compilers. The instructions are not generic; if I change
> datatypes from double to float, or int to short, I have to completely
> rewrite the function.
What stops you from adding a function that specializes on the types on
top of these SIMD-specific functions/vectors?
> Maybe someone wants to run my SSE-enabled code on an
> Altivec processor, what then? There may be counterparts to all the
> instructions, but I don't know Altivec. Even different versions of the same
> instruction set are a problem; do I really want to think about the benefits
> of SSSE3 vs. just SSE2 and write different versions of the same function for
> the various instruction sets?
Which is why I'd rather rely on the compiler to do it for me -- if the
compiler for the Altivec processor doesn't know how to auto-vectorize
my code, then tough luck I'm going to use SIMD-specific
functions/vectors anyway that's specific for that platform. If I
already had that SIMD-enabled min() function that broke up a large
vector into smaller SIMD vectors that did a faster min() than a
non-SIMD version of min(), then I can specialize on the Altivec
Heck, if I was writing a image manipulation library for different
platforms, I might chunk up the operations even at a higher level and
leverage the SIMD-izable parts at that level specific to the image
But again, I don't see a need for that library that is specific to
SIMD-enabling operations to be done on values. But then again, that's
> Just having a nice, generic wrapper around SIMD ops would be a big step to
> ease the writing of SIMD-enabled code and make it accessible to a wider
> community. I think the Proto-based approach holds promise for taking
> advantage of various instruction sets without the user having to think much
> about it.
Here lies the problem: SIMD operations are specific yet generic enough
to be used on their own already. What you want to be able to deal with
is the vendor-provided interface to the SIMD vectors and SIMD
operations they already support. More specifically GCC's support for
SSE/SSE2/SSE3/MMX/... registers/vectors and operations  and if
you're lucky enough to get your hand on Intel compilers, you can also
read about how to layout your code for the compiler to be able to
automatically vectorize it for you .
 - http://gcc.gnu.org/onlinedocs/gcc-4.3.2/gcc/Vector-Extensions.html#Vector-Extensions
 - http://www.aartbik.com/SSE/index.html
-- Dean Michael C. Berris Software Engineer, Friendster, Inc.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk