Boost logo

Boost :

Subject: Re: [boost] [gsoc] boost.simd news from the front.
From: David A. Greene (greened_at_[hidden])
Date: 2011-06-14 17:49:01


Mathias Gaunard <mathias.gaunard_at_[hidden]> writes:

> We're currently setting up a test farm, and we'll try to graph the run
> time in cycles of all of our functions on different architectures.
> Any recommendation and which micro-architectures to include for x86?
> We can't afford to have too many.
> We mostly work with Core and Nehalem.

Try to get some AMD processors in there. Specifically, something
Barcelona or later.

When the AVX chips are out, it would be interesting to compare them. I
suspect the offerings from Intel and AMD will be vastly different given
the described Bulldozer architecture.

>> With AMD and Intel's latest offerings alignment is much less of a
>> performance issue.
>
> What about portability?

Sure, if binary portability is required the user will have to use the
maximal subset of features.

>> What's the vector length of a pack<int> on Bulldozer?
>
> 256 bits because __m256i exists, and there are some instructions for
> those types, even if they are few.

It's going to perform really badly for any AVX machine because you end
up having to generate tons of shuffles.

> I need to benchmark this, but I think the conversions from/to AVX/SSE
> are sufficiently fast to make it a good choice in general.

They really are not. You get sabout 3-4x instructions in integer code,
for example. These sorts of things get determined with experience so
I wouldn't expect boost.simd or anyone else to get it perfectly right
out the gate. x86 in particular is very tricky to optimize. :)

> That's not what I remember of my last discussions with people working
> with the polyhedric optimization model. They told me they came close,
> but still weren't as fast as state-of-the-art BLAS implementations.

Sure, there are cases where hand-written code performs best. In those
cases, boost.simd seems like a great solution.

>> My case comes from years of experience.

> You certainly have experience in writing compilers, but do you have
> experience in writing SIMD versions of algorithms within applications?

Some, yes. I'm by no means a restructuring expert. But I have looked
at an awful lot of HPC code. I've managed to uncover tons of compiler
bugs/limitations. :)

> You don't seem to be familiar with SIMD-style branching or other
> popular techniques.

I'm not sure what you mean by "SIMD-syle branching." I think you mean
mask/merge/select operations.

> All tools are complimentary and can thus co-exist. What I didn't like
> about your original post is that you said compilers were the one and
> only solution to parallelization.

I'm sorry if I conveyed that message. I wasn't trying to argue that but
I can understand how it might seem that way. I was reacting to the set
of slides that seemed to argue that the compiler never gets it right.
:)

> At least we can agree on something now ;).

I suspect we can agree on a great deal. HPC is hard, for example. ;)

                             -Dave


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk