Boost logo

Boost :

Subject: Re: [boost] Accelerating algorithms with SIMD - Segmented iterators and alternatives
From: Simonson, Lucanus J (lucanus.j.simonson_at_[hidden])
Date: 2010-10-12 14:41:05


joel falcou wrote:
>> Could you share the results?
>>
> Tell me which stuff you JIT with ;)

I've been talking about the ct/rapidmind stuff, though my discussion has been about the general idea and not the specifics of what they may have implemented. I don't know exactly what is in there and haven't tried it yet myself.

> More on that, I'm eager to know how dynamically geenrating code (in a
> string ? in a file ? as soem bytecode array ?) THEN
> running a compiler then executing the resulting binary can beat
> statically compiled code dealing with vector register.
>
> Explain this to me and I convert my self.

You can write IR to memory, or load it as part of the process image, run the backend of the compiler on that and dynamically link in the result by replacing some function pointer in a table with the address of the executable code that is generated. Implement it as a try catch around the function call and initialize the function pointer in the table to a null pointer and you get near zero runtime overhead for subsequent function calls. Allow the JIT compiler to inline these functions into eachother and you can reduce the function overhead even further. Alternately, implement it as something that runs at program startup and eliminate the indirection entirely.

As to how it could be better:
If the dynamically generated code makes use of new hardware features not available at the time the original code was compiled....
If the dynamically generated code is better than the statically generated code because it has better algorithms for optimizing vector instructions...
If the compile time is negligable compared the the runtime of the resulting code...
If the dynamically generated code offloads work to the GPU through shared L3 cache...
If the dynamically generated code mutlithreads in addition to vectorizing the code and dynamically schedules it to all available cores...

Need I go on? I shall.

It doesn't make sense for vector intrinsics compiled by C++ compiler to outperform fortran compiled to vector instructions by the fortran compiler. I know perfectly well that it is the same back end compiler, but the difference is that with C++ the compiler can't know which optimizations are safe that the fortran compiler can easily make. As long as the embedded langauge allows the compiler to make the assumptions needed to enable optimizations that the C++ compiler can't then the code it generates will outperform the C++ compiler even if it's the exact same backend compiler being used. There is practically no way you can tell the C++ compiler which assumptions are safe, there just aren't sufficient pragmas and flags for that and even if there were they wouldn't be portable.

In any case, I don't really know much about this stuff, I encourage you to look at what's in there yourself. It could be it's much less good than I imagine it to be based upon my brief reading of the description. You could be a lot smarter than the guys from MIT who did it, who knows?

Regards,
Luke


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk