Boost logo

Boost :

Subject: Re: [boost] [Endian] Performance
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2011-09-07 15:03:56

Stewart, Robert wrote:
> Examples were given recently of swapping bytes in large data
> sets, where each extra cycle quickly accumulates into noticeable
> delays.

I've just tried this:

static inline uint32_t byteswap(uint32_t src)
   return (src<<24)
        | ((src<<8) & 0x00ff0000)
        | ((src>>8) & 0x0000ff00)
        | (src>>24);

int main()
   uint32_t buf[1024];
   while (1) {
     size_t bytes = read(0,buf,4096);
     if (!bytes) break;

My first test file is a 600 MB video, which is large enough to fit into
the test system's RAM; these timings are for "warm" runs i.e. the test
data is all cached. This is on the i.MX53 system I called "B" in my
last posts. Output is redirected to /dev/null.

1. With the byteswapping DISABLED, i.e. just the I/O:
real 0m2.495s
user 0m0.030s
sys 0m2.460s

2. With the byteswapping enabled, compiled with -O4:
real 0m3.019s
user 0m0.900s
sys 0m2.110s

In this case the core of the assembler is this nice simple loop:

        ldr r1, [r3, #0]
        rev r1, r1
        str r1, [r3], #4
        cmp r0, r3
        bne .L4

3. With the byteswapping enabled, compiled with -O4 -funroll-loops:
real 0m2.516s
user 0m0.450s
sys 0m2.060s

(This code is much longer and less readable but it seems to be worthwhile.)

4. Then I tried this bytewise swapping: (Please tell me if you think
this code is wrong, I didn't check the output)

   char buf[4096];
   while (1) {
     size_t bytes = read(0,buf,4096);
     if (!bytes) break;
     for (int i=0; i<bytes; i += 4) {

Timing with -O4: (loop unrolling seems to make no improvement)
real 0m5.131s
user 0m3.180s
sys 0m1.940s

Increasing the block size doesn't make any significant difference;
reducing it below 4096 bytes does slow it down.

So the overhead of byteswapping compared to I/O - for a file cached in
memory - is between about 25% (case 3) and 150% (case 4) on this system.

I've also done a quick test with a file that is larger than RAM and so
must be read from the SSD. In this case the results are:

Case 4 above, bytewise swapping:
real 0m44.969s
user 0m15.910s
sys 0m12.800s

Case 3 above, rev instruction with loop unrolling:
real 0m44.624s
user 0m2.290s
sys 0m12.990s

So as expected the amount of CPU time used scales approximately as
before, but the elapsed time doesn't change as it's limited by the SATA
interface or SSD to around 50 MB/sec.

Personally I think these savings are worthwhile, and I believe that a
library developer should normally assume that potential users of a
library will have applications that need optimal performance, even if
the developer is happy with something more modest.

Regards, Phil.

Boost list run by bdawes at, gregod at, cpdaniel at, john at