Boost logo

Boost :

Subject: Re: [boost] [gsoc] boost.simd news from the front.
From: David A. Greene (greened_at_[hidden])
Date: 2011-06-14 17:34:53


Mathias Gaunard <mathias.gaunard_at_[hidden]> writes:

>>> Shouldn't you just need the cache line size? This is something we
>>> provide as well.
>>
>> Nope. It's a LOT more complicated than that.
>
> Well, as far as I know, the only platform-specific stuff you do for
> matrix multiplication apart from vectorization is loop tiling.

Vector length matters. Instruction selection matters. Prefetching
matters.

> Can your magic compiler guarantee it will do a perfect job at this,
> with a cache size only known at runtime?

No one can ever guarantee "perfect." But the compiler should aim to
reduce the programming burden.

>> To a degree. How do you do different loop restructurings using the
>> library?
>
> I suggest you read some basic literature on what you can do with
> templates, in particular Todd Veldhuizen's "Active Libraries:
> Rethinking the roles of compilers and libraries"
>
> <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.40.8031&rep=rep1&type=pdf>

Thanks for the pointer! Printed out and ready to digest. :)

> And if I had some pragmas just around the bit I want, that starts to
> look like explicit vectorization.

But not like boost.simd as the actual algorithm code doesn't get
touched.

> Our goal with Boost.SIMD is not to write vector code manually (you
> don't go down to the instruction level), but rather to allow to make
> vectorization explicit (you describe a set of operations on operands
> whose types are vectors).

But again, that's not always the right thing to do. Often one only
wants to partially vectorize a loop, for example. I'm sure boost.simd
can represent that, too, but it is another very platform-dependent
choice.

>>> But my personal belief is that automatic parallelization of arbitrary
>>> code is an approach doomed to failure.
>>
>> Then HPC has been failing for 30 years.
>
> That's funny, because a huge portion of HPC people seems to be busy
> recoding stuff for multicore and GPU.

For the most part, they are FINALLY moving away from pure MPI code and
starting to use things like OpenMP. For the GPU, good compiler
solutions have only just begun to appear -- there's always a lag in tool
availability for new architectures. But GPUs aren't all that new
anyway. They are "just" vector machines.

But I don't know of many HPC users who want to yet again restructure
their loops. At best you can get them to restructure again IF you can
guaratee them that the restructured code will run well on any machine.
That's why performance portability is critical.

HPC users have had it with rewriting code.

> How come they have to rewrite it since we have automatic
> parallelization solutions? Surely they can just input their old C code
> to their compiler and get optimal SIMD+OpenMP+MPI+CUDA as output.

If the algorithm has already been written to vectorize, it should map to
the GPU just fine. If not, it will have to be restructured anyway,
which is generally beyond the capability of compilers or boost.simd,
though many times loop directives can just tell the compiler what to do.

> Their monolithic do-it-all state-of-the-art compiler provided by their
> hardware vendor takes care of everything they should need, and is able
> to predict all of the best solutions, right?

In many cases, yes. Of course, some level of user directive helps a
lot. It's very hard for the compiler to automatically decide which
kernels should run on a GPU, for example. But the user just needs a
directive here or there to specify that.

> And CUDA, which is arguably hugely popular these days, requires people
> to write their algorithms in terms of the kernel abstraction.

Yuck, yuck, yuck! :) CUDA was a fine technology when GPUs becvame
popular, but it is being replaced quickly by things like the PGI GPU
directives. The pattern is, do it manually first, then use compiler
directives (designed based on learning from the manual approach), then
have the compiler do it automatically (not possible in general, but
often very effective in certain cases).

> Automatic parallelization solutions are nice to have. Like all
> compiler optimizations, it's a best effort thing. But when you really
> need the power, you've got to go get it yourself. At least that's my
> opinion.

Sometimes, yes. But I would say with a good vectorizing compiler that
is extremely rare. Given an average compiler, I certainly see the
goodness in boost.simd.

> If my requirements are to use some hardware -- not necessarily limited
> to a single architecture -- to the best of their abilities, I'm better
> off describing my code in the way most fit for that hardware than
> betting everything on the fact the automatic code restructuration of
> the compiler will allow me to do that.

I think it's a combination of both. Overspecification often hampers the
compiler's ability to generate good code. Things should be specified at
the "right" level given the available compiler capabilities. Of course,
the definition of "right" varies widely from implementation to
implementation.

>>> Maybe if the compiler was really that good, it could still do the
>>> optimization when vectors are involved?
>>
>> No, because information has been lost at that point.
>
> How so? What information?

Unrolling is a good example. A hand-unrolled and scheduled loop is very
difficult to "re-roll." That has all sorts of implications for vector
code generation.

                                 -Dave


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk