Boost logo

Boost :

Subject: Re: [boost] [gsoc] boost.simd news from the front.
From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2011-06-10 21:33:37


On 11/06/2011 02:08, David A. Greene wrote:

> What's the difference between:
>
> ADDPD XMM0, XMM1
>
> and
>
> XMM0 = __builtin_ia32_addpd (XMM0, XMM1)
>
> I would contend nothing, from a programming effort perpective.

Register allocation.

>> Currently we support all SSEx familly, all AMD specific stuff and
>> Altivec for PPC and Cell adn we have a protocol to extend that.
>
> How many different implementations of DGEMM do you have for x86? I have
> seen libraries with 10-20.

That's because they don't have generic programming, which would allow
them to generate all variants with a single generic core and some
meta-programming.

We work with the LAPACK people, and some of them have realized that the
things we do with metaprogramming could be very interesting to them, but
we haven't had any research opportunity to start a project on this yet.

> Ok, that's a bit unfair. You are not trying to reproduce BLAS or
> anything. But let's say someone wants to write DGEMM. He or she has a
> couple of options:

We gave it a quick try, we were slower, so we didn't look into it too much.
We may attack it again some other day. For now we consider the versions
that exist elsewhere fast enough.

> - Write it using the operator overloads provided by boost.simd. Note
> that the programmer will have to take into account various
> combinations of matrix size and alignment, target microarchitecture
> and ISA and will probably have to code many different versions.

Shouldn't you just need the cache line size? This is something we
provide as well.

Ideally you shouldn't need anything else that cannot be made
architecture-agnostic.

And as I said, you should make the properties on size (and even
alignment if you really care) a template parameter, so as to be able to
dispatch it to relevant bits at compile-time...

> - Use an autotuning framework that generates many different variants by
> exploiting the abilities of a vectorizing compiler.

C++ metaprogramming *is* a autotuning framework.

Except there is significantly less effort in writing a library than in
writing a compiler; and a library is not tied to a particular compiler,
which is a great advantage.

> I have seen cases in real code where using all of the elements is
> exactly the wrong thing to do. How would I express this in boost.simd?

Ignore the elements you don't care about?

> Your rationale, as
> I understand it, is to make exploiting data parallelism simpler.

No it isn't.
Its goal is to provide a SIMD abstraction layer. It's an infrastructure
library to build other libraries. It is still fairly low-level.

Making data parallelism simpler is the goal of NT2. And we do that by
removing loops and pointers entirely.

There is more to data parallelism than SIMD. It's just one of the
building blocks.

> Intel and PGI compilers are not better than gcc because they can
> vectorize and gcc cannot. gcc can vectorize just fine. Intel and PGI
> compilers are better than gcc because they understand how to restructure
> code at a high level and they have been taught when (and when not!) to
> vectorize and how to best use the vector hardware. That is something
> not easily captured in a library.

Again, we're not interested in automatic restructuration and
vectorization of arbitrary code.

It's an interface with which the programmer can explicitly structure his
code for vectorization.

I don't want to write random loops and have my compiler parallelize them
when it happens to find some that are parallelizable, I want a tool that
helps me write my code in the way I need to in order for it to generate
vectorized instructions.

>
>>> Intel and PGI.
>>
>> Ok, what guys on non intel nor PGI supproted machine does ?
>> Cry blood ?
>
> If boost.simd is targeted to users who have subpar compilers

Other compilers than intel or PGI are subpar compilers? Maybe if you
live in a very secluded world.

They may be good at vectorization, but they're not even that good at C++.

> But please don't go around telling people that compilers can't
> vectorize and parallelize. That's simply not true.

Run the trivial accumulate test?
The most little of things can prevents them from vectorizing. Sure, if
you add a few restrict there, a few pragmas elsewhere, some specific
compiling options tied to floating point, you might be able to get the
system to kick in.

But my personal belief is that automatic parallelization of arbitrary
code is an approach doomed to failure.
Programming is about making things explicit using the right language for
the task. The language of classical loops in C with pointers is
ill-suited to describe operations that can be evaluated in parallel.

That approach doesn't seem to be near as successful as tools and
languages for explicit parallelization, be it for task-parallel or
data-parallel problems.

Fortran is a bit better, but not quite there. Matlab, as a language, is
more interesting.

> Boost.simd could be useful to vendors providing vectorized versions of
> their libraries.

Not all fast libraries need to be provided by hardware vendors.

> I have seen too many cases where programmers wrote an "obviously better"
> vector implementation of a loop, only to have someone else rewrite it in
> scalar so the compiler could properly vectorize it.

Maybe if the compiler was really that good, it could still do the
optimization when vectors are involved?


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk