Subject: Re: [boost] [gsoc] boost.simd news from the front.
From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2011-06-11 20:53:51
On 11/06/2011 17:30, David A. Greene wrote:
>> The benchmarks show this. Just in the slides, there is a 4x
>> improvement when using std::accumulate (which is still a pretty
>> trivial algorithm) with pack<float> instead of float, and in both
>> cases automatic vectorization was enabled.
> With which compilers?
Mainstream compilers with mainstream options: what people use.
The boostcon talk was pretty simple, it was not aimed at HPC experts,
but rather tried to show that the library could be useful to everyone.
> But how does the programmer know what is best? It changes from
> implementation to implementation.
The programmer is not supposed to know, the library does.
The programmer writes a * b + c, we generate a fma instruction if
available, and a multiplication followed by an addition otherwise.
Now it is true there are some cases where there are multiple choices,
and which one is fastest is not clear and may depend on the
micro-architecture and not just the instruction set.
We don't really do different things depending on micro-architecture, but
I think the cases where it really matters should be rather few.
When testing on some other micro-architectures, we notice that there are
a few unexpected run times, but nothing that really justifies doing
different codegen, at least at our level.
We're currently setting up a test farm, and we'll try to graph the run
time in cycles of all of our functions on different architectures.
Any recommendation and which micro-architectures to include for x86? We
can't afford to have too many.
We mostly work with Core and Nehalem.
> That's overkill. Alignment often isn't required to vectorize.
It doesn't cost the user much to enforce it for new applications, and it
allows us not to have to worry about it.
> With AMD and Intel's latest offerings alignment is much less of a
> performance issue.
What about portability?
> 32 is not always the right answer on SandyBridge/Bulldozer because
> 256 bit vectorization is not always the right answer.
If 256-bit vectorization is not what you want, then you have to specify
the size you want explicitly.
Otherwise we always prefer the size that allows the most parallelism.
> What's the vector length of a pack<int> on Bulldozer?
256 bits because __m256i exists, and there are some instructions for
those types, even if they are few.
But I suppose 128 bits could also be an acceptable choice.
I need to benchmark this, but I think the conversions from/to AVX/SSE
are sufficiently fast to make it a good choice in general.
It's a default anyway, you can set the size you want.
> That is in fact what is happening via autotuning. Yes, in some cases
> hand-tuned code outperforms the compiler. But in the vast majority of
> cases, the compiler wins.
That's not what I remember of my last discussions with people working
with the polyhedric optimization model.
They told me they came close, but still weren't as fast as
state-of-the-art BLAS implementations.
> My case comes from years of experience.
You certainly have experience in writing compilers, but do you have
experience in writing SIMD versions of algorithms within applications?
You don't seem to be familiar with SIMD-style branching or other popular
> It may be useful in some instances, but it is not a general solution to
> the problem of writing vector code. In the majority of cases, the
> programmer will want the compiler to do it because it does a very good
All tools are complimentary and can thus co-exist. What I didn't like
about your original post is that you said compilers were the one and
only solution to parallelization.
At least we can agree on something now ;).