Subject: Re: [boost] [gsoc] boost.simd news from the front.
From: David A. Greene (greened_at_[hidden])
Date: 2011-06-10 20:08:40
Joel falcou <joel.falcou_at_[hidden]> writes:
> On 10/06/11 18:05, David A. Greene wrote:
>> For writing new code, I contend that a good compiler, with a few
>> directives here and there, can accomplish the same result as this
>> library and with less programmer effort.
> I dont really think so, especially if you take your "portable" stuff
> in the equation.
I have seen it done.
>> A simple example:
>> void foo(float *a, float *b, int n)
>> for (int i = 0; i< n; ++i)
>> a[i] = b[i];
>> This is not obviously parallel but with some simple help the user can
>> get the compiler to vectorize it.
> Seriously, are you kidding me ? This is a friggin for_all ...
> You can not get more embarrasingly parallel.
No, it's not obviously parallel. Consider aliasing.
>> Another less simple case:
> And this accumulate, they are like the most basic EP example you can get.
IF the user allows differences in answers. Sometimes users are very
>> This is much less obviously parallel, but good compilers can make it so
>> if the user allows slightly different answers, which they often do.
> Yeah and any brain dead developper can write the proper
> boost::accumulate( simd::range(v), 0. )
> to get it right.
What's right? I want bit-reproducability with that IBM machine from 10
Yes, in cases like that one would simply not use boost.simd and would
tell the compiler not to vectorize. I'm trying to point out questions
and problems that arise in production systems.
> So, who's the compiler's daddy here ?
>> Can you explain why not? Assembly code in and of itself is not bad but
>> it raises some maintainability questions. How many different
>> implementations of a particular ISA will the library support?
> Because it is C functions maybe :E
What's the difference between:
ADDPD XMM0, XMM1
XMM0 = __builtin_ia32_addpd (XMM0, XMM1)
I would contend nothing, from a programming effort perpective.
> Currently we support all SSEx familly, all AMD specific stuff and
> Altivec for PPC and Cell adn we have a protocol to extend that.
How many different implementations of DGEMM do you have for x86? I have
seen libraries with 10-20.
Ok, that's a bit unfair. You are not trying to reproduce BLAS or
anything. But let's say someone wants to write DGEMM. He or she has a
couple of options:
- Write it in assembler. Note that the programmer will have to take
into account various combinations of matrix size and alignment, target
microarchitecture and ISA and will probably have to code many
- Write it using the operator overloads provided by boost.simd. Note
that the programmer will have to take into account various
combinations of matrix size and alignment, target microarchitecture
and ISA and will probably have to code many different versions.
- Write just one version using either of the above. It will work
reasonably well in many cases and completely stink in others.
- Use an autotuning framework that generates many different variants by
exploiting the abilities of a vectorizing compiler.
I'm sure there are other options, but these are the most common
approaches. Everyone in the industry is moving to the last option.
>> 4 floats are available. That does not mean one always wants to use
>> all of them. Heck, it's often the case one wants to use none of
> Not usign all element in a SIMD vector is Doing It Wrong.
I have seen cases in real code where using all of the elements is
exactly the wrong thing to do. How would I express this in boost.simd?
What happens when I move that code to another implementation where using
all of the elements in a vector is exactly the right thing to do?
>> I'm demonstrating what I mean by "performance portable." Substitute
>> "GPU" with any CPU sufficiently different from the baseline.
> I wish you read about what a "library scope" and "rationale" mean.
> Are you complaining that Boost.MPI dont cover GPU too ?
MPI is a completely different focus and you know it. Your rationale, as
I understand it, is to make exploiting data parallelism simpler. That's
good! We need more of that. I am trying to explain that simply using
vector instructions is usually not enough. Vectorization is hard. Not
the mechanics, that's relatively easy. Getting the performance out is a
lot of work. That is where most of the effort in vectorizing compilers
Intel and PGI compilers are not better than gcc because they can
vectorize and gcc cannot. gcc can vectorize just fine. Intel and PGI
compilers are better than gcc because they understand how to restructure
code at a high level and they have been taught when (and when not!) to
vectorize and how to best use the vector hardware. That is something
not easily captured in a library.
>> Intel and PGI.
> Ok, what guys on non intel nor PGI supproted machine does ?
> Cry blood ?
If boost.simd is targeted to users who have subpar compilers, that's
fine. But please don't go around telling people that compilers can't
vectorize and parallelize. That's simply not true.
Boost.simd could be useful to vendors providing vectorized versions of
their libraries. These are cases where for some reason or other the
compiler can't be convinced to generate the absolute best code. That
happens and I can see some protability benefits from Boost.simd there.
But it is not as easy as you make it out to be and I don't think
Boost.simd should be sold as a perferred, general way for everyday
programmers to exploit data parallelism.
I have seen too many cases where programmers wrote an "obviously better"
vector implementation of a loop, only to have someone else rewrite it in
scalar so the compiler could properly vectorize it.