Subject: Re: [boost] [Booster] Or boost is useless for library developers
From: Lassi Tuura (lat_at_[hidden])
Date: 2010-05-20 13:23:55
>> - Inline functions is best way to improve performance.
> I've found this to be true in my own work. So have lots of other people. Prove us wrong.
Let me have a go at describing some bloat. It's not in my interest to prove anything. I'll just tell you cost we see in real-life applications.
On Linux, GCC 4.5.x, x86_64, we have executables which load 598 DSO images mapped in 611 memory regions, corresponding to:
- 263'431'108 100.0% bytes mapped from shared libraries
- 261'638'634 99.3% bytes total allocated to sections
- 1'792'474 0.7% bytes padding not in any section (= rounding to page)
The break down by sections that are actually loaded into memory:
- 114'086'965 43.3% code (.text)
- 65'205'187 24.8% dynamic symbols and related tables
- 26'825'486 10.2% unwind tables
- 24'356'624 9.2% plt + relocations + related tables
- 19'000'712 7.2% global data
- 11'325'928 4.3% global common data (.bss)
- 749'576 0.3% various shared library headers
- 82'138 0.0% global constructors and destructors
- 5'890 0.0% glibc memory management voodoo
- 128 0.0% thread-specific data
That's ~55% "real stuff", ~25% of symbol tables, ~10% unwind tables, ~10% relocations and PLT. The application virtual memory size is about a gigabyte, so this is a major fraction of the overall footprint.
There are 544'533 symbols which represent 142'548'100 bytes. Of this there are 272'190 weak symbols, or 43'565'063 bytes.
A significant fraction of those weak symbols represent template duplication across libraries, but that's not the only form of bloat we see. There are 2'599 symbols with at least 10 duplicates, total 5'832'419 bytes, and 118 vtables with at least 10 duplicates (about 300k).
So over half of the symbols and about a third of the size are ill-advicedly generated inline functions, virtual function tables (19'043 vtables = 2'802'928 bytes) and type info objects and names (45'961 typeinfo objs + names = 3'142'851 bytes). This goes with accompanying symbol tables, PLTs, unwind tables, and so on.
A significant fraction of the 60+ MB symbol tables is obviously for long mangled names.
There is a very significant number of tiny (5-7 byte) functions which are actually just stub that call the real function via PLT. If function A calls B, we actually have A call out-of-line stub function which calls PLT entry which jumps to B. And this is mostly for inline functions that should never have been inline in the first place!
Unsurprisingly we see a lot of CPU stalls linked to instruction prefetch issues. We used to see massive scale TLB failures as well (over 60% of L2 cache accesses were for code some time back), but with wider TLBs there's less of that. Intel's performance experts have indicated to us directly that the amount of code we have is a real challenge and reducing the amount of code, and better code locality, are likely required to improve performance.
I am certain there are projects where "headers only" is nice. There are also projects where the entire package is written into the header files, on the assumption the compiler or build system will sort out where to put the code, generates massive additional costs.
(Boost is a contributor to the above bloat, but not the top one.)
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk