Boost logo

Boost :

Subject: Re: [boost] [Endian] Performance
From: Adder (adder.thief_at_[hidden])
Date: 2011-09-05 23:05:41


Dear All,

I have humbly taken a look upon the "conversion.hpp" section
of Beman Dawes' library and I was able to learn a lot... Thank you...

I was able to learn a lot from Tymofey's and Phil Endecott's code too,
so thank you too !

I am just a programmer-wannabe... I do hope that my message
is not understood as disrespect, for it is not meant as such, not at all.

I have conducted a small and non-representative benchmark,
the ugly source code of which I have uploaded here:

  http://preview.tinyurl.com/4yhcc8t

It compares several versions of code meant to "bswap"
16-bit, 32-bit and 64-bit integer values.

In the table below, I have used the following notations:

  RC-1:
    This is endian::revert from the release-candidate
    (non-final, I think) version of Beman Dawes' library.

  Tymofey:
    A combination of my own unworthy hack for uint16_t,
    Phil Endecott's version for uint32_t (for USE_TYMOFEY)
    and Tymofey's suggestion for uint64_t.

  Imbecil-0:
    Boost-wannabe.RawMemory compiled with BOOST_RAW_MEMORY_TRICKS
    #define'd as 0.

  Imbecil-1:
    Boost-wannabe.RawMemory compiled with BOOST_RAW_MEMORY_TRICKS
    #define'd as 1 (the default).

Here are the approximative results,
on the same computer that is described here:

  http://adder.iworks.ro/Boost/RawMemory/#Benchmarks
  (Ctrl+F: "And the name of the computer is").

(Much to my shame, I have not employed Windows' "high-performance counter".
Also, I did not run too many repetitions in order to avoid overheating
the lovely computer or the benchmarks turning against me.)

(For the 64-bit integers -- what a devious choice ! --,
I have also noted the approximative number of machine code bytes
that were generated.)

Borland C++Builder 5.5:

  uint16_t
    RC-1 1765
    Tymofey 265
    Imbecil-0 250
    Imbecil-1 250

  uint32_t
    RC-1 1922
    Tymofey 453
    Imbecil-0 469
    Imbecil-1 453

  uint64_t
    RC-1 2360 ( 72 bytes of code)
    Tymofey 3375 (225 bytes of code)
    Imbecil-0 2234 (169 bytes of code)
    Imbecil-1 797 (7 bytes for the caller, 15 bytes for the callee)

Digital Mars C++:

  uint16_t
    RC-1 360
    Tymofey 250
    Imbecil-0 265
    Imbecil-1 250

  uint32_t
    RC-1 1828
    Tymofey 391
    Imbecil-0 1203
    Imbecil-1 453 <-- I am such a noob !

  uint64_t
    RC-1 2797 ( 84 bytes of code)
    Tymofey 2875 (292 bytes of code)
    Imbecil-0 2453 (202 bytes of code)
    Imbecil-1 609 (7 bytes for the caller, 15 bytes for the callee)

GCC 4.3.4 (20090804):

  uint16_t
    RC-1 1750
    Tymofey 188
    Imbecil-0 187
    Imbecil-1 187 <-- Of course, I chose the run that favours me !

  uint32_t
    RC-1 1328
    Tymofey 453
    Imbecil-0 438
    Imbecil-1 188

  uint64_t
    RC-1 2781 ( 67 bytes of code)
    Tymofey 1578 ( 96 bytes of code)
    Imbecil-0 578 ( 46 bytes of code)
    Imbecil-1 250 ( 8 bytes of code)

Visual C++ 2003:

  uint16_t
    RC-1 984
    Tymofey 187
    Imbecil-0 125
    Imbecil-1 203 <-- I have to investigate this !

  uint32_t
    RC-1 1563
    Tymofey 375
    Imbecil-0 515
    Imbecil-1 125

  uint64_t
    RC-1 2328 ( 63 bytes of code)
    Tymofey 3047 (154 bytes of code)
    Imbecil-0 1110 (122 bytes of code)
    Imbecil-1 438 ( 12 bytes of code,
                        but I had to "WorkAround.cpp"
                        an Internal Compiler Error
                        by avoiding the inlining the function;
                        I have to investigate this !)

Visual C++ 2005:

  uint16_t
    RC-1 906
    Tymofey 188
    Imbecil-0 187
    Imbecil-1 203

  uint32_t
    RC-1 1906
    Tymofey 391
    Imbecil-0 328
    Imbecil-1 125

  uint64_t
    RC-1 3437 ( 65 bytes of code)
    Tymofey 3047 (151 bytes of code)
    Imbecil-0 641 ( 61 bytes of code)
    Imbecil-1 188 ( 8 bytes of code)

Visual C++ 2005 for x64:

  uint16_t
    RC-1 1687
    Tymofey 203
    Imbecil-0 188
    Imbecil-1 203

  uint32_t
    RC-1 1390
    Tymofey 328
    Imbecil-0 329
    Imbecil-1 188

  uint64_t
    RC-1 2406 ( 72 bytes of code)
    Tymofey 703 (210 bytes of code; the loop was unrolled 1:2)
    Imbecil-0 531 (162 bytes of code; the loop was unrolled 1:2)
    Imbecil-1 187 ( 3 bytes of code)

The "Tymofey" optimizations are included in the previous emails
from Tymofey and Phil Endecott.

The "Imbecil" optimizations (and pessimizations) are described and
included here:

  http://adder.iworks.ro/Boost/RawMemory

If Assembler, compiler intrinsics, __fastcall,
compiler-specific tuning, etc. sound interesting, you are welcome to have
a closer look.

Thank you for your time and for your work...

--
Yours truly,
Adder
On 9/6/11, Beman Dawes <bdawes_at_[hidden]> wrote:
> On Mon, Sep 5, 2011 at 6:38 PM, Phil Endecott
> <spam_from_boost_dev_at_[hidden]> wrote:
>> I've just done some quick benchmarks of Beman's proposed byte-swapping
>> code...
>>
>> What do people see on other platforms?
>
> Twenty plus years ago I put a lot of effort into finding optimum code
> for a C language endian library, but real-world application tests
> showed that what was optimum for one compiler was a dog on another
> compiler, that compiler switches could change what was optimum code,
> and then for the next release of the compiler we had to do it all over
> again.
>
> That said, if we can come up with a benchmark representative of
> real-world uses cases, and can come up with robust optimizations that
> have some staying power, I'll gladly include them in the code.
>
> --Beman

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk