After fiddling around on linux with clang 3.8 and gcc optimizer options
i got down to this. With gcc and -O3.

Benchmark          Time(ns)    CPU(ns) Iterations
-------------------------------------------------
to_wire_xml           11174      11178     381818
to_wire_text           5148       5149     820313
to_wire_binary         3327       3330    1141304
to_wire_cstyle           63         63   65217391
from_wire_xml         27170      27183     155096
from_wire_text         5371       5370     783582
from_wire_binary       3226       3228    1296296
from_wire_cstyle         45         45   93750000

This results look very nice. <6µs for serilize/deserialize a structure
to a portable text archive seems very nice :)

Now is the difference again pretty big compared to windows .....


For what it's worth, in tests I've done in the past, binary serialization using boost.serialization and other similar systems was not this big of a difference compared to memcpy. I was seeing maybe a 5x to 10x difference compared to memcpy (yours is 50x).

Of course, this depends on a lot of factors, like how much data this is because it would determine if you are memory bound or not, but I am wondering if your cstyle tests are actually being completely optimized away. Have you examined the disassembly? If you find the code is being optimized away, Google Benchmark has a handy "benchmark::DoNotOptimize" function to help keep the optimizer from throwing away the side effects of a particular address.

-- chris