Boost logo

Boost :

Subject: Re: [boost] [OT?] SIMD and Auto-Vectorization (was Re: How to structurate libraries ?)
From: Joel Falcou (joel.falcou_at_[hidden])
Date: 2009-01-18 03:16:58


Dean Michael Berris a écrit :
> What do you mean by parallelization for CotS (Commodity off-the Shelf)
> machines?
I took the link provided to cray.com as an answer to my parallelization
question. What I mean is that not everyone have to deal with large
mainframe but more so beowulf like machine or even simple multi-core
machines on which running vendor-specific runtime middleware may or may
not be sensible. Moreover, not every people that need parallelism want
to do HPC. Computer Vision and multimedia applciations are also highly
demanding and sometimes even more as they are also bound to some
real-time or interactive-time constraints in which the GFLOPS is not
what you seek.

Anyway, consider my answer on this subject as miscommunication.

> If you mean dealing with high level parallelism across machines, you
> already have Boost.MPI and I suspect you don't intend to replace that
> or tackle that kind of parallelism at the large scale.
>
MPI is by no way what I would call high-level. It's only a
message-passing assembly language with a few interesting abstraction.
BSP based tools or algorithmic skeletons based tools are real high-level
tools for inter-machine parallelization. And well, I already have
replaced MPI by something with an higher abstraction level for at least
three architecture style (cluster, mutli-cores, cell) and published
about this (see [1] for reference) and plan to do so at boost'con this
year to show exactly how boost meta-programming tools made those kind of
tools possible and usable.
> If you mean dealing with parallelism within a program, then you
> already have Boost.Thread and Boost.Asio (with the very nice
> io_service object to be able to use as a job-pool of sorts run on many
> threads).
>
Then again : low level. I've dealt with a large variety of people
ranging from physicist to computer vision experts. They are all
embarassed when they have to take their Matlab or C or FORTRAN legacy
code to C++/thread. Some of them doesn't even KNOW their machine has
those kind of features or don't even know about simple thing like
scaling, gustafson-barsis or amdhal laws and think that as they have 3
machines with 4 cores, their code will magically go 12 times faster no
matter what. I've been a parallel software designer in such a laboratory
and I can tell you that most expert in field X have no idea on how to
tackle MPI , ASIO or even simple thread. Worse, sometimes they don't
want to because they find this uninteresting. Ok, you might say that
they can just send their code to someone to parallelize. Alas, most of
the time they don't want to as they' won't be able to be sure you didn't
butchered their initial code (following the good ol' Not Made Here
principles). At this point, you *have* to give them tools they
understand/want to use/are acustomed with so they can do the porting
themselves. And this requires providing either really high-level or
domain-specific interface to those parallelism level which include,
among other, to use things like thread or asio by hiding them very very
deeply. Then again,that's my own personnal experiment. If you have some
secret management tricks to have parallelism-unaware people to use
low-level tools, then I'm all open to hear them :) cause I have to do
this on a daily basis.
> At a lower level, you then have the parallel algorithm extensions to
> the STL already being shipped by compiler vendors (free and
> commercial/prioprietary) that have parallel versions of the STL
> algorithms (see GNU libstdc++ that comes with GCC 4.3.x and
> Microsoft's PPL).
>
Same remark than before. There are people out there that don't even know
STL (and sometimes proper C++) exists or even worse don't want to use it
cause, you never know, it can be bugged. And sadly, these are'nt jokes :/
> actually, a vector supercomputer is basically a SIMD machine at a
> higher level -- that's if I'm understanding my related literature
> correctly.
What I meant to say is that they have a fundamentally different
low-level API and such auto-vectorization at machine level may be
different than intra-processor SIMD code generation. Then again, this
remarks are maybe due to me badly reading previous post.

> You might be surprised but the GNU Compiler Collection already has
> -ftree-vectorize compiler flag that will do analysis on the IR
> (Intermediate Representation) of your code and generate the
> appropriate SSE/SSE2/SSE3/MMX/... for your architecture. They advise
> adding -msse -msse2 for x86 platforms to the flags.
Yes I am aware and all my experimentation show it produces correct code
for trivial software but fail to vectorize larger code piece and die in
shame as soon as you need shuffle, typecasting and other similar features.
> Although I don't think you should stop with the research for better
> ways to do automatic vectorization, I do however think that doing so
> at the IR level at the compiler would be more fruitful especially when
> combined with something like static analysis and code transformation.
>
Except this means a new compiler version, which means that end-user has
either to wait for those algorithm to be in the mainstream gcc or w/e
compiler distribution OR that they'll have to use some experimental
version. In the former case, they don't want to wait. In the later case,
they don't want to have to install such thingy. The library approach is
a way to get thing that works out fast and only based on existing compilers.
Nothing prevent this library to evolve as compiler do.

On code transformation, that's exactly what a DSEL do: Building new
'source' from a high-level specification and writing "compiler
extensions" from the library side of the world. I've been doing this
since POOMA and the avent of Blitz++ and always thought they were
spot-on for things like parallelism. When proto was first announced and
became usable, it was a real advance as building DSL started to look
like building a new language in the good ol' ways.

> You might even be surprised to know that OpenCL (the new framework for
> dealing with heterogeneous scalable parallelism) even allows for
> run-time transformation of code that's meant to work with that
> framework -- and Apple's soon-to-be-released new version of its
> operating system and compiler/SDK will be already supporting.
>
Oh, I await openCL fondly, so I'll have a new low-level tools to
generate high-level libraries ;)
> Because GCC needs help with auto-vectorization, and that GCC is a
> best-effort project (much like how Boost is). If you really want to
> see great auto-vectorization numbers, maybe you can try looking at the
> Intel compilers? Although I haven't personally gotten (published and
> peer-reviewed) empirical studies to support my claim, just adding
> auto-vectorization in the compilation of the source code of the
> project I'm dealing with (pure C++, no hand-written SIMD stuff
> necessary) I already get a significant improvement in performance
> *and* scalability (vertically of course).
I'm aware of this. Gcc auto-vectorize is for me still in its infancy.
Take a simple dot product. The auto-vectorized version shows a speed-up
of 2.5 for floating point value. The handmade version goes up to 3.8.
Moreover, it only covers statically analyzable loop nest to be
vectorized (i speak of what can be done nowadays with 4.3 and 4.4, not
what's announced in various paper). With icc, the scope of vectorizable
code is larger and contain sensible parts. The code quality is also
superior. Except not all platform are intel platform.

> I think a "better" alternative would be to help the GCC folks do a
> better (?) job at writing more efficient tree-vectorization
> implementations and transformations that produce great SIMD-aware code
> built into the compiler. If somehow you can help with the pattern
> recognition of auto-vectorizable loops/algorithms from a higher level
> than the IR level and do source-level transformation of C++ (which I
> think would be way cool BTW, much like how the Lisp compilers are able
> to do so) to be able to produce the right (?) IR for the compiler to
> better auto-vectorize, then *that* would be something else.
>
Fact is, well, I'm more acustomed to do software engineering than
compiler writing. I wish I could lend a hand but that's far out of
skills scope.
But, I'm open to learn new thing. If you have entry in the gcc
community, I'm all for it.

> Maybe you'd also like to look at GCC-ICI [0] to see how you can play
> around with extending GCC then it comes to implementing these
> (admittedly if I may say so) cool optimizations of the algorithms to
> become automatically vectorized.
>
Reference noted. :)
> I think I understand what you're trying to achieve by adding a layer
> of indirection at the library/code level -- however I personally think
> (and have read through numerous papers already, while doing my
> research about parallel computing earlier in my student days) that
> these optimizations/transformations are best served by the tools that
> create the machine code (i.e. compilers) rather then dealt with at the
> source code level. What I mean by this is that even though it's
> technically possible to implement a "parallel C++ DSEL" (which can be
> feasibly achieved now with Boost.Proto and taking some lessons with
> attribute grammars [1] and how Spirit 2(x?) with a combination of
> Boost.Phoenix does it) it would be reaching a bigger audience and
> serving a larger community if the compilers became smarter at doing
> the transformations themselves, rather than getting C++ developers to
> learn yet another library.
>
Well except if this simd library was not aimed at users. The vec library
I proposed is here for library builder as a way to quickly express SIMD
code fragment across SIMD platform. It is NOT meant to be yet another
array class with SIMD capability because this models is too restrictive
for library developpers that may need SIMD access for doing something
else with a differen tmodel. And well, it doesn't sound worse than
learning a new boost.asio or boost.thread library.

I already have something far more high-level from which this code is
extracted from. NT2 is a matlab-like scientific computing library[2]
that just reinject Matlab syntax into C++ via a coupel of DSEL and take
care of vectorization and thread creation on multi-core machine. This
was found to be more appealing as user can just get their Matlab code,
copy+paste them into a cpp file, search/replace a few syntax quirks and
compile with NT2 to get instant performance increase. I can tell you
that this appeal more to HPC users than any low-level API, even with the
sexier encapsulation you can have.

As for the source level code. They are plenty and successful. Now, are
you able to name *one* that is actually used outside academic research ?
I'm still convinced that high-level, domain-oriented tools are the way
to go, not just adding new layer of things like MPI, OpenCL or what not.
However, those layers, which have the tendency to be multiplied as
architecture flavor changes, need to abstracted. And that's only such an
abstraction that I was proposing for SIMD intrinsics : a simple, POD
like entity that maps onto those and take care of generating the most
efficient code for a given platform. I don't why it could be different
in scope than boost.thread that do the same with threading API.

On a larger scale, my stance on the problem is the following :
Architectures get more and more complex and, following this trend,
low-level tools starts to loose their interest but are mostly the only
thing available. What is needed is a high-level abstraction layer on
those. But no abstraction can encompass *all* need of parallel software
developers so there is a need for various models and abstraction
(ranging from arrays to agents to god know what) and all of them need an
interoperable interface. This is what DSL construction tools allow us to
do : quickly and properly specify those precise scoped tools in form of
library. When proper compiler support and/or tools will become
available, we'll just shift the library implementation, much like Boost
already support 0x constructs inside its implementation.

Hoping that we don't hijack the list too much. But I'm curious to know
the position of other boost members concerning how parallelism problem
should be solved within our bounds.

References
[1] Quaff:
 -> principles : http://www.lri.fr/~falcou/pub/falcou-PARCO-2007.pdf
 -> the Cell version (french paper) :
http://www.lri.fr/~falcou/pub/falcou-SYMPA-2008.pdf
[2] NT2 : http://www.springerlink.com/content/l4r4462r25740127/

-- 
___________________________________________
Joel Falcou - Assistant Professor
PARALL Team - LRI - Universite Paris Sud XI
Tel : (+33)1 69 15 66 35

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk