Subject: Re: [boost] [gsoc] boost.simd news from the front.
From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2011-06-11 18:37:12
On 11/06/2011 17:42, David A. Greene wrote:
> Mathias Gaunard<mathias.gaunard_at_[hidden]> writes:
>> Register allocation.
> But that's not where the difficult work is.
Right, NP-complete problems are not difficult.
It's not really a problem when you're doing a small function in
isolation, but we want all the functions to be inlineable (and most of
them to be inlined), and we don't know in advance whether we need to
copy the operands, which registers will be used and which will be
>>>> Currently we support all SSEx familly, all AMD specific stuff and
>>>> Altivec for PPC and Cell adn we have a protocol to extend that.
>>> How many different implementations of DGEMM do you have for x86? I have
>>> seen libraries with 10-20.
>> That's because they don't have generic programming, which would allow
>> them to generate all variants with a single generic core and some
> No. No, no, no. These implementations are vastly different. It's not
> simply a matter of changing vector lenght.
>> We work with the LAPACK people, and some of them have realized that
>> the things we do with metaprogramming could be very interesting to
>> them, but we haven't had any research opportunity to start a project
>> on this yet.
> I'm not saying boost.simd is never useful. I'm saying the claims made
> about it seem overblown.
What I was saying about usage of meta-programming applied to the writing
of adaptive and fast linear algebra primitives is completely unrelated
to Boost.SIMD, albeit it could use it.
>>> - Write it using the operator overloads provided by boost.simd. Note
>>> that the programmer will have to take into account various
>>> combinations of matrix size and alignment, target microarchitecture
>>> and ISA and will probably have to code many different versions.
>> Shouldn't you just need the cache line size? This is something we
>> provide as well.
> Nope. It's a LOT more complicated than that.
Well, as far as I know, the only platform-specific stuff you do for
matrix multiplication apart from vectorization is loop tiling.
Can your magic compiler guarantee it will do a perfect job at this, with
a cache size only known at runtime?
>> C++ metaprogramming *is* a autotuning framework.
> To a degree. How do you do different loop restructurings using the
I suggest you read some basic literature on what you can do with
templates, in particular Todd Veldhuizen's "Active Libraries: Rethinking
the roles of compilers and libraries"
>> The most little of things can prevents them from vectorizing. Sure, if
>> you add a few restrict there, a few pragmas elsewhere, some specific
>> compiling options tied to floating point, you might be able to get the
>> system to kick in.
> Yep. And that's a LOT easier the hand-restructuring loops and writing
> vector code manually.
I may not want to set the floating point options for my whole
And if I had some pragmas just around the bit I want, that starts to
look like explicit vectorization.
Our goal with Boost.SIMD is not to write vector code manually (you don't
go down to the instruction level), but rather to allow to make
vectorization explicit (you describe a set of operations on operands
whose types are vectors).
>> But my personal belief is that automatic parallelization of arbitrary
>> code is an approach doomed to failure.
> Then HPC has been failing for 30 years.
That's funny, because a huge portion of HPC people seems to be busy
recoding stuff for multicore and GPU.
How come they have to rewrite it since we have automatic parallelization
solutions? Surely they can just input their old C code to their compiler
and get optimal SIMD+OpenMP+MPI+CUDA as output.
Their monolithic do-it-all state-of-the-art compiler provided by their
hardware vendor takes care of everything they should need, and is able
to predict all of the best solutions, right?
Actually, I suppose it works fairly well with old Fortran code.
Yet all of these people found reasons that made them want to use the
tools themselves directly.
And CUDA, which is arguably hugely popular these days, requires people
to write their algorithms in terms of the kernel abstraction. It
wouldn't be able to just work well with arbitrary C code. And still, I
understand it's still fairly complicated for the compiler to make this
work well, despite the imposed coding paradigms.
Automatic parallelization solutions are nice to have. Like all compiler
optimizations, it's a best effort thing. But when you really need the
power, you've got to go get it yourself.
At least that's my opinion.
>> Programming is about making things explicit using the right language
>> for the task.
> Programming is about programmer productivity.
Productivity implies a product.
What matters in a product is that it fulfills the requirements.
If my requirements are to use some hardware -- not necessarily limited
to a single architecture -- to the best of their abilities, I'm better
off describing my code in the way most fit for that hardware than
betting everything on the fact the automatic code restructuration of the
compiler will allow me to do that.
When compilers start to guarantee some optimizations, maybe it will change.
But the feedback I gathered from compiler people is that they could not
guarantee certain types of transformations would be consistently applied
regardless of data size; running the passes in different order could
yield better or worse results, same with running them multiple times etc.
Compilers just give "some" optimization, there is no formalism behind
that can prove it will always reduce certain patterns. Having a
fully-optimized program still requires explicitly doing it so.
We're at a state where we even have to force inlining in some cases,
because even with the inline specifier some compilers do not do the
>>> Boost.simd could be useful to vendors providing vectorized versions of
>>> their libraries.
>> Not all fast libraries need to be provided by hardware vendors.
> No, not all. In most other cases, though, the compiler should do it.
Monolithic designs are bad.
Some people specialize in specific things, and they should be the
providers for that thing.
>>> I have seen too many cases where programmers wrote an "obviously better"
>>> vector implementation of a loop, only to have someone else rewrite it in
>>> scalar so the compiler could properly vectorize it.
>> Maybe if the compiler was really that good, it could still do the
>> optimization when vectors are involved?
> No, because information has been lost at that point.
How so? What information?
There is no assembly involved, the compiler still has full knowledge.
There is no reason why the compiler couldn't tell that
__m128 a, b, c;
c = __mm_add_ps(a, b); // probably calls __builtin_ia32_addps(a, b)
is the same as
float a, b, c;
c = a + b
except it does it four floats at a time.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk