Boost logo

Boost :

From: Simonson, Lucanus J (lucanus.j.simonson_at_[hidden])
Date: 2008-05-02 20:00:22

Steven wrote:
>Ok. There might be extra optimizations possible when the index is
>at compile time as opposed to run time. (I'm not talking about the
>between get<X>(p) and p[X], here, but the difference between when X is
>known at compile time using templates to avoid code duplication vs.
>and using function arguments). This is really a property of the
>algorithm rather
>than the point class, though.

I agree with you. It really taxes the compiler to optimize my highly
nested inline function calls and it has too much opportunity to give up
early instead of getting the job done. Switching from gcc 3.4.2 to gcc
4.2.0 resulted in about a 30% speedup in application code that relies
heavily on my types and algorithms. Compile times went up slightly too.
That tells me that the compiler is less than fully successful in
optimizing things. If the compiler is having trouble providing constant
propagation we can't necessarily expect it to optimize away the overhead
of the compile time accessor either, but at least it doesn't have the
option of giving up before instantiating the template function.
On a related note, we recently confirmed that the 4.3.0 compiler (on
newer hardware) converts:

int myMax(int a, int b){ return a > b ? a : b;}


.globl _Z5myMaxii
        .type _Z5myMaxii, @function
        .file 1 ""
        .loc 1 7 0
        .loc 1 7 0
        cmpl %edi, %esi
        cmovge %esi, %edi
        .loc 1 10 0
        movl %edi, %eax

instead of:

.globl _Z5myMaxii
        .type _Z5myMaxii, @function
        .file 1 ""
        .loc 1 7 0
        pushq %rbp
        movq %rsp, %rbp
        movl %edi, -4(%rbp)
        movl %esi, -8(%rbp)
        .loc 1 9 0
        movl -4(%rbp), %eax
        cmpl -8(%rbp), %eax
        jle .L2
        movl -4(%rbp), %eax
        movl %eax, -12(%rbp)
        jmp .L3
        movl -8(%rbp), %eax
        movl %eax, -12(%rbp)
        movl -12(%rbp), %eax
        .loc 1 10 0

when compiling for old processor or with old compiler. That is about 4X
fewer instructions and NO BRANCH instructions. Note: cmovge is a new
instruction in the Core2 (merom) processors. I have been using the

template <class T>
inline const T& predicated_value(const bool& pred, const T& a, const T&
b) {
  const T* input[2] = {&b, &a}; return *(input[pred]);

instead of ? syntax because it was 35% faster than the branch based
machine code the compiler generated when executed on the prescott based
hardware at the time. I'll be able to go back to letting the compiler
know best as soon as we cycle out the old hardware and cycle in the new


Boost list run by bdawes at, gregod at, cpdaniel at, john at