Boost logo

Boost :

Subject: [boost] SIMD EDSL blues (was [XInt] Some after thoughts)
From: Joel Falcou (joel.falcou_at_[hidden])
Date: 2011-03-12 02:15:11


On 11/03/11 22:57, Simonson, Lucanus J wrote:
> The problems come from load/store, which if you emit too many the compiler generally doesn't optimize them away. You need to somehow manage which varaibles are in register and which are in memory and if you get that wrong it is hard to right that wrong. For one thing alias analysis won't be successful enough to do what the programmer could. For another, when it comes to intrinsics the compiler puts some trust in the programmer's choices and doesn't always try to second guess. Loading and storing memory can get complicated in a great hurry and you generally want to rearrange your algorithm and data structures for more efficient memory access. Just aligning things is a very big deal and it isn't always enough to say "oh, the allocator will take care of that for me." There are a lot of complications once you get into the details of it. The compiler can sometimes do amazing things, but other times produce a really weak effort and it will never replace your algorithm with a
> better one based on knowledge of the instruction set. There might be good explanations for why the compiler can't emit the code you expected, but that doesn't change the fact that it often dissapoints.

Alignment is one issue, the only thigns it solves it that you can be
sure you can mova* instead of movu* without thinking about it. For data
rearrangement, this is a whole different level of stuff and we solved it
by *not* havign to care at the SIMD DSL level, the upper array-like
interface take care of this usign some fusion adaptation and
transformation mumbo-jumbo. It is vry crucial that you separate the
concerns there. The SIMD DSL should only help you lay down your SIMD
data-flow algorithm and help you map it to w/e IS you may have in a
platform independent way, provide you tools for iterating SIMD chunk by
SIMD chunk over arbitrary memory and have the significant pattern
matcher in place for your ISA. Now, all the other algorithmic level
transform and other memory layout and access problems have to be solved
at an upper level (in w/e abstraction you need matrix, array, big int,
image, polynoms etc) because the very core optimizations you may want
are expressed in a proper way in the high-level domain, not the low
level one. Same goes for unrolling or stripmining loop nests, it is not
the job of the SIMD DSL. It is somethign that has uses in soem domain
but the other. The SIMD DSL should provide an interface tat make writing
such numerical code the same wether it is indeed in SIMD or scalar mode.

I really think it is this problem you encountered by trying to make your
SIMD DSL do too much at its level, something that ultimately fails. The
only such a beast should do is hide the intricacies of setting up
load/store from memory into SIMD registers, uniformize function and
operators across ISA, provide SIMD based iterator and range travrsal
support. It should not do many more but provide, mainly in its pattern
matching facility or in its way to define a function, ways to be
extended for any new subtle low level combination of operations leading
to optimized SIMD block or IS support. See below for the rest of what
can be done.

> To implement optimal infinite prcision integer arithmetic with lrb vector instructions (or altavec) you have tons of goodies to help you in the instruction set like multiply and store high result and the carry add borrow sub and gather scatter, but if you abstract the instruction set away you don't even get the basics right and all the goodies are out of reach entirely because you need to craft your algorithm to fit the features of the instruction set to make use of them.

At some point, it all depends of the way you did your homework. Having a
one-to-one mapping between one SIMD extension instruction to one
operator only works so far for classical operations. However, and we
agree on this, if IS 1 has a funky
'add-two-vectors-and-take-the-sqrt-of-them' fused operations and not the
other, it is rather trivial to have some function like sqrt_sum(a,b) and
have it using the intrinsic in ISA1 and do it "by the book" in IS 2
support. In the same way, your pattern matcher will depends on the IS
and do the proper contraction if needed. Our stance on this matter was
not to say "abstract to the common lowest denominator of features" but
try to see all SIMD IS as an aggregate pool of features and fill the gap
of IS with less features than others. Our approach was to seggregate
functions into "toolbox" with different aims and providing mid-level
functionnality that get locally optimized with w/e IS support. Now, w/e
the IS, you can call it and depending on what you have it generates the
proper algorithm. Now if i was about to rewrite some large integer
function or some crypto algorithm, what I will do is look at the proper
SIMD way to do it, seekign for the most fused possible intrinsic for the
job (for crypto, IIRC SSE4 has some specific stuff for ex., so let's sue
them in the sse4 version). And after that, ho and behold, generic
programming strikes again: we will probably take a few time to
lift/abstract/generalize said algorithm in proper conceptualized series
of operations. Said operations will then be mapped into new function in
the SIMD DSL (and prolly in a sequential fashion too). Rewrite the
algorithm operating on a parametric type T, use said new functions and
watch the code generated be either SIMD or scalar depending on the ISs
selected.

In terms of integer stuff at hand, this means the high level algorithm
should be expressed as a combination of lifted concepts where the actual
bit twiddling can be aggregated in logical, semantic-rich, functions
operating on w/e. Let the SIMD DSL provide such function, implement them
as properly needed (use the proper specialized itnrinsic or add pattern
replacer) and voila.

No need to dumb down the feature set of the SIMD DSL, it will grow
naturally as cocnept requiring it and some special stuff will arise from
algorthimic analysis, much more like normal code. And int his case,
you'll see that these problem of optimality actually lessens as they
willbe captured at high level by your concept based analysis.

I bet we have a lot of stuff to exchange and argue on the subject as we
seems to have taken radically different angle on the matter, I will be
glad if we can go over our first premises of arguments and maybe try to
see if our respective experiences in some concrete area can't be shared
for the greater good of everyone. Especially, if you have actual
algorithm you see as something hard that didn't fit your design (and by
concrete I mean a snapshot of code not a mere description in words were
both of us can misinterpret parts (see the CSE fiasco above :p) so we
can look at the problem in our framework and see where problem arise.

Hoping to help :)


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk