Boost logo

Boost :

Subject: Re: [boost] [XInt] Some after thoughts
From: Simonson, Lucanus J (lucanus.j.simonson_at_[hidden])
Date: 2011-03-10 19:06:37


Jeremy Maitin-Shepard wrote:
> On 03/10/2011 01:06 AM, falcou wrote:
>
> > [snip]
>
>> 4/ Have a way to say large integer use SIMD register of 128 bits and
>> implement a SIMD aware large integer
>>
>
> This last item seems particularly useful, but I don't see how it has
> anything to do with a potential restructuring. Since non-bitwise
> operations aren't explicitly single instruction multiple data
> operations, really what we are talking about is optimizing the library
> so as to make the most use of the processor capabilities, which may
> include SSE instructions, etc. (Presumably the performance of GMP
> already reflects such optimizations.)
>
> Maybe what you have in mind is letting the digit type be a template
> parameter, and then substituting in a user-defined type (or in the
> case
> of some compilers, a compiler-defined non-standard type) that serves
> as
> a 128-bit unsigned integer. I'm not convinced that this level of
> abstraction is compatible with generation of optimal code, though.
> Furthermore, this abstraction doesn't seem particularly useful, as the
> only purpose I can imagine of specifying a non-default digit type
> would
> be for this particular optimization.

I can attest that it is generally not compatible with generation of optimal code. Others may disagree, but I've had some unique experience with trying to create a C++ abstraction on vector instruction sets. Other than the fact that I couldn't do some necessary AST transformations in the ETs of our DSEL for vector operations because there is no way to identify whether two arguments to an expression are the same varaible, only the same type, there were many, many other problems that caused the reality to fall short of our expectations and promising results early on. Frankly, if you want fast SSE accelerated code you should do it by hand in assembler. For infinite precision integer in particular you need access to the carry bit that is generated by vector artihmetic and other instruction level features. Just replacing int with int_128_t isn't going to get you too far. You need to hand unroll loops and generally put a lot of thought into what you are doing at the instruction level to get the best implementation and everything needs to be driven by benchmark results. Even intrisics don't give you the ammount of control that is needed to get max performance because the compiler can't run benchmarks to guide its choices the way you can. (Well, Intel compiler actually can to some extent...) Instead of writing assember we should provide points of customization to allow the user to override all the algorithms with their own implementation. That could be gmp, their own hand coded assembly or some high level code that has been optimized through Intel parallel building blocks, for example, which accomplishes with vector type abstraction in C++ what ETs cannot through use of a JIT compilation step.

Regards,
Luke


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk