Boost logo

Boost :

Subject: Re: [boost] [XInt] Some after thoughts about SIMD
From: Simonson, Lucanus J (lucanus.j.simonson_at_[hidden])
Date: 2011-03-11 16:57:26


Joel Falcou wrote:
> On 11/03/11 21:04, Simonson, Lucanus J wrote:
>> If you have the same variable in more than one place in an
>> expression you can map that expression to different/fewer
>> instructions than if they were all different variables in different
>> places. There is no way to perform that mapping using expression
>> templates because it cannot be detected in template metaprogramming
>> which varaibles are the same, only that they have the same type.
>> Therefor you cannot map expression to the optimal instructions using
>> template metaprogramming.
>>
>> In general three register instructions that modify one register and
>> perhaps use it as an input also will map to expression where the
>> variable is on both sides of the assignment operator. Think fused
>> multiply add.
>>
>> At first we thought we could do anything with expression templates.
>> It is not true. We can only perform logic on types at compile time.
>> You can't implement a compiler for runtime variables as a template
>> metaprogram, only a compiler for types. I already gave the example
>> of common sub expression elimination being unimplementable in
>> expression templates. Rather than a contrived example it is
>> actually the most general and easily understood example of the
>> problem. It is a pretty important optimization in and of itself,
>> but it is also widely understood, at least by readers of this list.
> Well, I dunno which compiler you use, but usually, when you do stuff
> properly, the ET are usually optimized by the compiler itself
> afterward
> and it usually do this for you.
> We made the test in nt2 with a shifted load based FIR using a very
> simple TMP logic, and well ... it never used more register than what
> we though tit should.
>
> As for madd, well ... we do pattern matching on the AST structure,
> reemit a madd node and let the thingy do its job afterward.
> I fail to see where the fail is, but it is important to have the ET
> emit rather trivial code and let the compiler polish the code
> generation afterward.
>
> Under GCC, the CSE stuff is done byt he compiler as if it was normal
> code.
> We noticed loop unrolling done through matrix ET and other such
> effect.
> Which compiler did you used ? How are your ET base code done ?

The problems come from load/store, which if you emit too many the compiler generally doesn't optimize them away. You need to somehow manage which varaibles are in register and which are in memory and if you get that wrong it is hard to right that wrong. For one thing alias analysis won't be successful enough to do what the programmer could. For another, when it comes to intrinsics the compiler puts some trust in the programmer's choices and doesn't always try to second guess. Loading and storing memory can get complicated in a great hurry and you generally want to rearrange your algorithm and data structures for more efficient memory access. Just aligning things is a very big deal and it isn't always enough to say "oh, the allocator will take care of that for me." There are a lot of complications once you get into the details of it. The compiler can sometimes do amazing things, but other times produce a really weak effort and it will never replace your algorithm with a better one based on knowledge of the instruction set. There might be good explanations for why the compiler can't emit the code you expected, but that doesn't change the fact that it often dissapoints.

I know you may object that first I say that the problem is having the same variable in multiple places to map to instructions like madd and now I'm saying the problem is load and store. Trust me, these problems are very much interrelated.

To implement optimal infinite prcision integer arithmetic with lrb vector instructions (or altavec) you have tons of goodies to help you in the instruction set like multiply and store high result and the carry add borrow sub and gather scatter, but if you abstract the instruction set away you don't even get the basics right and all the goodies are out of reach entirely because you need to craft your algorithm to fit the features of the instruction set to make use of them.

Regards
Luke


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk