Boost logo

Boost :

Subject: Re: [boost] [gsoc] boost.simd news from the front.
From: David A. Greene (greened_at_[hidden])
Date: 2011-06-10 19:05:12


Joel falcou <joel.falcou_at_[hidden]> writes:

> On 10/06/11 17:09, David A. Greene wrote:

>> It's not a high level of abstraction. It's a very low level one. Users
>> are barely willing to restructure loops to enable vectorization. Many
>> will be unwilling to rewrite them completely. On the other hand, the
>> data show that they are quite willing to add directives here and there.
>
> If range are not higher level than for loop, I think we can stop
> discussing right here.

The for loops already exist. I'm primarily talking about users
vectorizing existing code.

For writing new code, I contend that a good compiler, with a few
directives here and there, can accomplish the same result as this
library and with less programmer effort.

>> On what code? It's quite easy to achieve that on something like a
>> DGEMM. DGEMM is also an embarrassingly vectorizable code.
>
> Give me one example of non-EP code which needs and can be vectorized.

Many codes are not embarrassingly vectorizable. Compilers have to jump
through major loop restructuring and others things to expose the
parallelism. Often users have to do it for the compiler, just as they
would for this library. Users don't like to manually restructure their
code, but they are ok with putting directives in the code to tell the
compiler what to do.

A simple example:

void foo(float *a, float *b, int n)
{
   for (int i = 0; i < n; ++i)
      a[i] = b[i];
}

This is not obviously parallel but with some simple help the user can
get the compiler to vectorize it.

Another less simple case:

float foo(float *restrict a, int n)
{
   float result = 0.0;

   for (int i = 0; i < n; ++i)
      result += a[i];

   return result
}

This is much less obviously parallel, but good compilers can make it so
if the user allows slightly different answers, which they often do.

>> That's effectively assembly code.
>
> No.

Can you explain why not? Assembly code in and of itself is not bad but
it raises some maintainability questions. How many different
implementations of a particular ISA will the library support?

>> No. On SSEx machines, a vector of 32-bit floats can have 1, 2, 3 or 4
>> elements.
>
> No, SSE2 __m128 contains 4 floats. Period.

4 floats are available. That does not mean one always wants to use all
of them. Heck, it's often the case one wants to use none of them.

>> Consider AVX. This is _not_ an easy problem to solve. It is not always
>> the right answer to vectorize using the fully available vector length.
>
> AVX has 256 bits register and fits 8 floats. Again, what did I miss ?

The fact that the implementation of 256-bit operations may really stink
on a particular microarchitecture and the fact that integer operations
are not 256 bits so that one has to make a difficult tradeoff for loops
with mixed integer/floating point computation. This happens a lot.

>> I know what a pack<> is. Perhaps I wasn't clear. If I have an
>> operation (say, negation) under where() in which the even condition
>> elements are true and the odd condition elements are false, what is the
>> produced result for the odd elements of the result vector?
>
> where is ?:. It requires three argument. I tempted to say RTFM.
>
> a = c ? b; is not valid code, so neither is where(c,a);

Ah, I misread the slide. My example would look like this in boost.simd:

a = c ? -b : a

So effectively, the result retains the original values. I was
previously thinking more along the lines of vector predication. I guess
in the context of this library, your semantics make sense.

>> What happens if you move the code from Nehalem to Barcelona? How about
>> from an NVIDIA GPU to Nehalem?
>
> Where did I say this stuff targeted GPU.

I'm demonstrating what I mean by "performance portable." Substitute
"GPU" with any CPU sufficiently different from the baseline.

> You are again recycling the same non argument than in your last
> intervention on this very topic last year.

Sorry, my memory is not as good as it used to be. :) I'm not sure what
you're referring to.

>> Compilers have been doing this since the '70's. gcc is not an adequate
>> compiler in this respect, but it is slowly getting there.
>
> MSVC does not, neither xlC ... neither clang ... so which compilers
> takes random crap C code and vectorize it automagically ?

Intel and PGI.

                              -Dave


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk