Boost logo

Boost :

From: Simonson, Lucanus J (lucanus.j.simonson_at_[hidden])
Date: 2008-05-02 20:00:22


Steven wrote:
>Ok. There might be extra optimizations possible when the index is
known
>at compile time as opposed to run time. (I'm not talking about the
>difference
>between get<X>(p) and p[X], here, but the difference between when X is
>known at compile time using templates to avoid code duplication vs.
>runtime.
>and using function arguments). This is really a property of the
>algorithm rather
>than the point class, though.

I agree with you. It really taxes the compiler to optimize my highly
nested inline function calls and it has too much opportunity to give up
early instead of getting the job done. Switching from gcc 3.4.2 to gcc
4.2.0 resulted in about a 30% speedup in application code that relies
heavily on my types and algorithms. Compile times went up slightly too.
That tells me that the compiler is less than fully successful in
optimizing things. If the compiler is having trouble providing constant
propagation we can't necessarily expect it to optimize away the overhead
of the compile time accessor either, but at least it doesn't have the
option of giving up before instantiating the template function.
 
On a related note, we recently confirmed that the 4.3.0 compiler (on
newer hardware) converts:

int myMax(int a, int b){ return a > b ? a : b;}

into:

.globl _Z5myMaxii
        .type _Z5myMaxii, @function
_Z5myMaxii:
.LFB2:
        .file 1 "t255.cc"
        .loc 1 7 0
.LVL0:
        .loc 1 7 0
        cmpl %edi, %esi
        cmovge %esi, %edi
.LVL1:
        .loc 1 10 0
        movl %edi, %eax
        ret

instead of:

.globl _Z5myMaxii
        .type _Z5myMaxii, @function
_Z5myMaxii:
.LFB2:
        .file 1 "t255.cc"
        .loc 1 7 0
        pushq %rbp
.LCFI0:
        movq %rsp, %rbp
.LCFI1:
        movl %edi, -4(%rbp)
        movl %esi, -8(%rbp)
        .loc 1 9 0
        movl -4(%rbp), %eax
        cmpl -8(%rbp), %eax
        jle .L2
        movl -4(%rbp), %eax
        movl %eax, -12(%rbp)
        jmp .L3
.L2:
        movl -8(%rbp), %eax
        movl %eax, -12(%rbp)
.L3:
        movl -12(%rbp), %eax
        .loc 1 10 0
        leave
        ret

when compiling for old processor or with old compiler. That is about 4X
fewer instructions and NO BRANCH instructions. Note: cmovge is a new
instruction in the Core2 (merom) processors. I have been using the
following:

template <class T>
inline const T& predicated_value(const bool& pred, const T& a, const T&
b) {
  const T* input[2] = {&b, &a}; return *(input[pred]);
}

instead of ? syntax because it was 35% faster than the branch based
machine code the compiler generated when executed on the prescott based
hardware at the time. I'll be able to go back to letting the compiler
know best as soon as we cycle out the old hardware and cycle in the new
compiler.

Thanks,
Luke


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk