Boost :

Date view	Thread view	Subject view	Author view

Subject: Re: [boost] How to structurate libraries ?
From: David A. Greene (greened_at_[hidden])
Date: 2009-01-19 14:20:51

Next message: Zachary Turner: "Re: [boost] Lazy list"
Previous message: OvermindDL1: "Re: [boost] Lazy list"
In reply to: Joel Falcou: "Re: [boost] How to structurate libraries ?"
Next in thread: Joel Falcou: "Re: [boost] How to structurate libraries ?"
Reply: Joel Falcou: "Re: [boost] How to structurate libraries ?"

On Saturday 17 January 2009 03:33, Joel Falcou wrote:
> David A. Greene a écrit :
> > Ahem: http://www.cray.com
>
> Two points :
> 1/ Not everyone has access to a cray-like machine. Parallelization tools
> for CotS machines is not to be neglected and, on this front, lots of
> thing need to be done

gcc does some of this already. It's getting better. But it's far behind
other compilers.

> 2/ vector supercomputer != SIMD-enabled processor even if the former may
> include the later.

A Cray XT machine is made up of AMD Barcelona processors. It is a SIMD
machine. SIMD is nothing more than very short vectors. SSE lacks some
nice hardware that allows compilers to vectorize more things but
auto-vectorization is perfectly doable with SSE when the hardware supports it.

> > Auto parallelization has been around since at least the '80's in
> > production machines. I'm sure it was around even earlier than that.
>
> What do you call auto-parallelization ?
>
> Are you telling me that, nowaday , I can take *any* source code written
> in C or C++ or w/e compile it with some compiler specifying --parallel
> and automagically get a parallel version of the code ? If so, you'll

Yes, you'll get regions of parallel code. Compilers can look at loops and
schedule iterations across threads or cores. In some cases they can schedule
independent calls on different threads. The compiler won't do magical stuff
though.

> have to send a memo to at least a dozen research team (including mine)
> all over the world so they can stop trying working on this problem and
> move on something else. Should I also assume than each time a new
> architecture comes out, those compilers also know the best way to
> generate code for them ? I beg to differ, but automatic parallelization
> is far from "done".

Did I say it was done? A compiler is not going to be able take complex
pointer-chasing code and parallelize it automatically. With some of the new
parallel languages it has a better chance but in most cases the information
just isn't there. But your library didn't strike me as supporting general
parallelization along the lines of futures or such things. Structuring
high-level parallelism has almost[1] nothing to do with the minute machine
details you've talked about so far.

You're taking this way too personally. I'm not suggesting that researchers
close up shop. I'm suggesting that researchers should make sure they're
tackling the correct problems. My experience tells me the problem is not the
use of vector instructions[2] but rather lies in how to express high-level,
task-based parallelism.

[1] Note that I said "almost." Some hardware features can greatly aid
parallelization such as the semphore bits on the MTA. In these cases I would
like to see architecture-specific optimizations within other libraries such
as Boost.Futures and Boost.MPI.

[2] Except for special-purpose fields like graphics where a general-purpose
compiler probably hasn't been given the smarts because it's not worth it to
the general-purpose compiler vendor.

> Then again, by just looking at the problem of writing SIMD code :
> explain why we still get better performance for simple code when writing
> SIMD code by hand than letting gcc auto-vectorize it ?

Because gcc is not an optimizing compiler. That's not its focus. It's
getting better but I would encourage you to explore compilers from Intel,
PGI and Pathscale. All of these easily beat gcc in terms of performance.

Probably the biggest mistake academic researchers do is compare their results
to gcc. It's just not a valid comparison.

> > Perhaps your SIMD library could invent convenient ways
> > to express those idioms in a machine-independent way.
>
> Well, considering the question was first about how to structure the
> group of library i'm proposing,
>
> I apologize to not having taken the time to express all the features of
> those libraries.

Well, one would expect the author of a library to enumerate what it can do.
If you have these kinds of constructs, that's great!

> Moreover, even with a simple example, the fact that the
> library hides the differences between
> SSE2,SSSE3,SSE3,SSE4,Altivec,SPU-VMX and the forecoming AVX is a feature

Only insofar as it handles idioms the compiler won't otherwise recognize.

> on its own. Oh, and as specified in the former mail, the DSL take care
> of optimizing fused operation so thing like FMA are detected and
> replaced by the proper intrinsic when possible. Same with reduction like
> min/max, operations like b*c-a or SAD on SSEx.

A good compiler will do that too. I don't know that current compilers will
make use of PSADBW because the computation may not be general-purpose enough.
But it shouldn't be hard to teach a compiler about the idiom. It's simply a
specific kind of reduction.

> > Your simple SIMD expression example isn't terribly compelling. Any
> > competent compiler should be able to vectorize a scalar loop that
> > implements it
>
> Well, sorry then to have given a simple example.

Your claim was that compilers would not be able to handle it. I countered the
claim. I'm interested in any other examples you have.

> > What would be compelling is a library to express things like the Cell's
> > scratchpad. Libraries to do data staging would be interesting because
> > more and more processers are going to add these kinds of local memory
>
> I don't see what you have in mind. Do you mean something like Hierarchic
> Tiled Array ? or some Cell based development library ? If the later, I
> don't think boost is the best home for it. As for HTA, lots of
> implementation already exists, and guess what, they just do the
> parallelization themselves instead of letting the computer do it.

I can't find a reference for HTA specifically, but my guess from the name of
the concept tells me that it's probably already covered elsewhere, as you say.
Still, it might be worthwhile to propose a Boost version. Other MPI
implementations exist, for example, but that didn't stop Boost.MPI.

The closest reference I found was
http://portal.acm.org/citation.cfm?id=645605.662909

Is that what you mean?

A Cell-based development library wouldn't be terribly useful. Something that
provided the same idioms across architectures would be. As you say, the
architecture abstraction is important.

> Anyway, we'll be able to discuss the library in itself and its features
> when a proper thread for it will start.

I look forward to it. I think there's value here but we should figure out the
focus and drop any unnecessary things.

-Dave

Next message: Zachary Turner: "Re: [boost] Lazy list"
Previous message: OvermindDL1: "Re: [boost] Lazy list"
In reply to: Joel Falcou: "Re: [boost] How to structurate libraries ?"
Next in thread: Joel Falcou: "Re: [boost] How to structurate libraries ?"
Reply: Joel Falcou: "Re: [boost] How to structurate libraries ?"

Date view	Thread view	Subject view	Author view

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk