Boost logo

Boost :

Subject: Re: [boost] [config] RFC PR 82
From: Domagoj Saric (dsaritz_at_[hidden])
Date: 2015-12-15 18:02:26


On Tue, 01 Dec 2015 16:40:33 +0530, Andrey Semashev
<andrey.semashev_at_[hidden]> wrote:

> On 2015-12-01 09:18, Domagoj Šarić wrote:
>> On Wed, 25 Nov 2015 01:50:40 +0530, Andrey Semashev
>> <andrey.semashev_at_[hidden]> wrote:
>>>
>>> As far as I understand, sealed can be used only with C++/CLR, is
>>> that right? If so then I'd rather not add a macro for it.
>>>
>>> If on the other hand sealed can be used equivalently to final in
>>> all contexts, then you could use it to implement BOOST_FINAL.
>>
>> It is available in C++ but technically it is not _the_ C++ keyword
>> but an extension with the same purpose so 'purists' might mind that
>> BOOST_NO_CXX11_FINAL is not defined even when BOOST_FINAL is
>> defined to sealed instead of final...
>
> Again, if sealed is equivalent to final in all contexts then I don't
> mind BOOST_FINAL expanding to sealed. Otherwise think of a separate
> macro for sealed.

The question was about BOOST_NO_CXX11_FINAL: "'purists' might mind that
BOOST_NO_CXX11_FINAL is not defined even when BOOST_FINAL is defined to
sealed instead of final"...

>>>>> I don't see the benefit of BOOST_NOTHROW_LITE.
>>>>
>>>> It's a nothrow attribute that does not insert runtime checks
>>>> to call std::terminate...and it is unfortunately not offered
>>>> by Boost.Config...
>>>
>>> Do you have measurments of the possible benefits compared to
>>> noexcept? I mean, noexcept was advertised as the more efficient
>>> version of throw() already.
>>
>> What more measurements beyond the disassembly window which clearly
>> shows unnecessary EH codegen (i.e. bloat) are necessary?
>
> I'll reiterate, what are the practical benefits? I don't care about a
> couple instructions there or not there - I will never see them in
> performance numbers or binary size.

I guess then you were also against noexcept with the same presumptive 'a
couple of instructions (compared to throw())' rationale?
What is the N in the "N * a-couple-instructions" expression at which you
start to care?
What kind of an arugment is that anyway, i.e. why should anyone care
that you don't care? How does it relate to whether or not BOOST_NOTHROW
should be changed (or at least BOOST_NOTHROW_LITE added) to use the
nothrow attribute where available instead of noexcept (especially since
the macro itself cannot guarantee C++11 noexcept semantics anyway)?
You ask for practical benefits and then give a subjective/off the
cuff/question-begging reasoning for dismissing them...You may not mind
that the biggest library in Boost is a logging library of all things
while some on the other hand would like to see plain C finally retired
and C++ (and its standard library) be used (usable) in OS kernels[1] as
well as tiniest devices from the darkest corners of the embedded world
Some 'wise men' say the free lunch is over...

[1]
https://channel9.msdn.com/Events/Build/2014/9-015
An example discussion of exactly that - at ~0:17:00 they explicitly
mention drivers - I don't know about you but drivers coded with the "I
don't care about a couple instructions" mindset don't sound quite
exciting (even though most are already even worse than that, nVidia
display driver nvlddmkm.sys 12+MB, Realtek audio driver RTKVHD64.sys
4+MB...crazy...)

>>>>> I don't think BOOST_OVERRIDABLE_SYMBOL is a good idea, given
>>>>> that the same effect can be achieved in pure C++.
>>>>
>>>> You mean creating a class template with a single dummy template
>>>> argument and a static data member just so that you can define a
>>>> global variable in a header w/o linker errors?
>>>
>>> Slightly better:
>>>
>>> template< typename T, typename Tag = void > struct singleton {
>>> static T instance; }; template< typename T, typename Tag > T
>>> singleton< T, Tag >::instance;
>>
>> That's what I meant...and it is really verbose (and slower to
>> compile than a compiler-specific attribute)...
>
> I won't argue about compilation speeds, although I doubt that the
> difference (in either favor) is measurable. As for verbosity, the
> above code needs to be written only once.

But what if you want a 'proper' name for the global variable? You have
to name the tag type and then create some inline function
named-like-the-desired variable that will return the
singleton<Tag>::instance...+ this does not work for static member
variables or functions...

All compilers are already forced to implement such an attribute
internally precisely to support code such as you wrote above - so this
just asks that this be standardized and made public....

>>>>> Calling conventions macros are probably too specialized to
>>>>> functional libraries, I don't think there's much use for
>>>>> these. I would rather not have them in Boost.Config to avoid
>>>>> spreading their use to other Boost libraries.
>>>>
>>>> That's kind of self-contradicting, if there is a 'danger' of
>>>> them being used in other libraries that would imply there is a
>>>> 'danger' from them being useful...
>>>
>>> What I mean is that having these macros in Boost.Config might
>>> encourage people to use them where they would normally not.
>>
>> The same as above...I don't see a problem? If they are useful -
>> great, if not and people still use them - 'we have bigger
>> problems'...
>
> [snip]
>
>> There is no 'standard' calling convention, just the 'default'
>> one...and what headache can a non-default c.convention in an API
>> cause (e.g. the whole Win32 and NativeNT APIs use the non-default
>> stdcall convention)?
>
> By using non-default calling conventions you're forcing your users
> out of the standard C++ land. E.g. the user won't be able to store an
> address of your function without resorting to compiler-specific
> keywords or macros to specify the calling convention. It complicates
> integration of your library with other code. I'd rather strictly ban
> non-default calling conventions on API level at all.

* no compiler-specific keywords just a documented macro already used by
the API in question
* the macro is only needed if you need to declare the pointer/function
type yourself (instead of just passing the function address to an API,
using auto, decltype, lambdas or template type deduction or wrapping it
in something like std::function, signal/slot object...)
* explicit calling conventions in (cross platform) public APIs of
libraries and even OSs are a pretty common thing in my experience
* "forcing users out of the standard C++ land" - that's just moot i.e.
isn't that part of what Boost is about? i.e. there is nothing stopping
'us' from standardizing the concept of calling conventions (e.g. to
specify/handle the different architecture/ABI intricacies of 'evolving
hardware' - soft/hard float, different GPR file sizes, 'levels' of SIMD
units etc.)
* finally - you cannot just decide this from personal preference for all
people and all libraries, i.e. IMO that's up to individual libraries
(their devs and users) to decide - e.g. HPC, math, DSP, etc. libs should
be free to decide that the performance benefits of explicit/specialized
c.conventions outweigh the more than rare problem of a little more
verbose function pointer types...

>>> You might use them in library internals but there I think it's
>>> better to avoid the call at all - by forcing the hot code
>>> inline.
>>
>> Enter bloatware... A statically dispatched call to a 'near'
>> function has near zero overhead for any function with half-a-dozen
>> instructions _if_ it (i.e. the ABI/c.convention) does not force the
>> parameters to ping-pong through the stack... Forceinlining is just
>> a primitive bruteforce method in such cases...which eventually
>> makes things even worse (as this 'bloatware ignoring' way of
>> thinking is certainly a major factor why the dual-core 1GB RAM
>> netbook I'm typing on now slows down to a crawl from paging when I
>> open gmail and 3 more tabs...).
>
> There are different kinds of bloat. Force-inlining critical
> functions of your program will hardly make a significant difference
> on the total binary size, unless used unwisely or you're in hardcore
> embedded world where every byte counts.

This assumption can only be true if the 'critical functions of a
program' (force-inlined into every callsite!) comprise a non-significant
portion of the entire program...which is in direct contradiction with
presumptions you make elsewhere - such as that properly marking cold
portions of code is just not worth it...

Suddenly you are OK with "restricting users" (those in the 'hardcore
embedded world') as well as having/using keywords/macros (forceinline)
that can be used 'wisely' and 'unwisely'?...

+ sometimes, still, even with everything inlined, compilers still cannot
handle even simpler C++ abstractions 'all by them selves'
https://gist.github.com/rygorous/c6831e60f5366569d2e9

>> For dynamically dispatched calls (virtual functions) choosing the
>> appropriate c.convention and decorating the function with as many
>> relevant attributes is even more important (as the dynamic
>> dispatch is a firewall for the optimiser and it has to assume that
>> the function 'accesses&throws the whole universe')...
>
> My point was that one should avoid dynamic dispatch in hot code in
> the first place.

AFAICT I first mentioned dynamically dispatched calls.

> Otherwise you're healing a dead horse. Argument passing has little
> effect compared to a failure to predict the jump target.

Bald assertions (in addition to ignoring of parts of what I said):
* take a look @
https://channel9.msdn.com/Events/GoingNative/2013/Compiler-Confidential
~19:15 on the mind-boggling mind-reading (branch prediction)
capabilities of modern CPUs

+ what I already said: it is not just about the direct speed impact but
about the detrimental impact on the (optimisation) of code surrounding
the callsites (creating bigger and slower code)...some attributes (like
noalias, pure and const) can even allow a compiler to hoist a virtual
call outside a loop...

> Even when the target is known statically (i.e. non-virtual function
> call) the effect of the call can be significant if it's on the hot
> path - regardless of the calling convention.

A static call to a (cached/prefetched) function that does not touch the
stack has pretty much the overhead of two simple instructions CALL and
RET (and CPUs have had dedicated circuitry, RSBs, for exactly that for
ages).
Please give me an example of a function not automatically inlined (even
at Os levels) where this is a 'significant effect' (moreover even if you
could, that still wouldn't prove your point - all that is needed to
disprove it is the existence of a function whose call overhead is made
insignificant by using a better c.convention and appropriate attributes
- trivial)...

>>> If that code is unimportant then why do you care?
>>
>> Already explained above - precisely because it is unimportant it is
>> important that it be compiled for size (and possibly moved to the
>> 'cold' section of the binary) to minimise its impact on the
>> performance of the code that does matter; loading speed of the
>> binary; virtual memory; disk space, fragmentation and IO...
>
> I think, you're reaching here. Modern OSs don't 'load' binaries, but
> map them into address space. The pages are loaded on demand,

You don't say...obviously that's exactly what I meant. The pages have to
be loaded eventually (otherwise my computer would "just map the OS into
address space" when I turn it on ::roll eyes::) and if your binaries are
just lazily 'completely compiled for speed' (w/o PGO) then the
'important' and 'unimportant' parts of your code will be interspersed
(i.e. the 'unimportant' bits will have to be loaded along with the
'important' ones).

Especially true with today's compilers which additionally try to 'fix
lazy programming' with autovectorization, autoparallelization, loop
unrolling, transformation...this not only explodes codesize but also
slows down release builds - so properly marking 'unimportant' code has
the additional benefit of faster builds.

> and the typical page size is 4k - you'd have to save at least 4k of
> code to measure the difference, let alone feel it.

Let's try and see how hard it is to save something with Boost.Log:
- VS2015 U1
- combined sizes of both library DLLs
- bjam variant=release link=shared runtime-link=shared address-model=32
optimization=%1 cxxflags="/Ox%2 /GL /GS- /Gy"
where (%1 = space, %2 = s) and (%1 = speed, %2 = t) for for-size and
for-speed optimised builds, repsectively:
* for-speed build: 959kB
* for-size build: 730kB
-> that's a delta of 229kB or 31% (the relative difference is much
larger if we compare only the text/code sections of the DLLs because of
all the strings, RTTI records etc...)
And according to your own assumption that hot code is insignificant in
size, it follows that you can shave off 30% from Boost.Log exactly by
having non-hot code compiled for size...

> Virtual address space is not an issue, unless you're on a 32-bit
> system, which is only wide spread in the embedded area.

"Restricting user choices" and "reaching" perhaps? (ARMv7) mobile phone
owners should just shut up?
http://www.fool.com/investing/general/2013/01/19/mobile-overtakes-desktops-in-2013-microsoft-be-war.aspx
https://en.wikipedia.org/wiki/Usage_share_of_operating_systems
http://searchenginewatch.com/sew/opinion/2353616/mobile-now-exceeds-pc-the-biggest-shift-since-the-internet-began

> The disk space consumption by data exceeds code by magnitudes, which
> in turn shows on IO, memory and other related stuff.

If you say so, like for example:
* CMake 3.4.1 Win32 build:
- ~40MB total size
- of that ~24MB are binaries and the rest is mostly _documentation_
(i.e. not program data)
- cmake-gui.exe, a single dialog application
* Git4Windows 2.6.3 Win64 build:
- ~415MB (!?) total size
- of that ~345MB are binaries
* or things like Windows and Visual Studio which show even narrower
ratios but on the gigabyte scale...

So, when you wait for Windows, Visual Studio, Android Studio, Eclipse,
Adobe Acrobat, Photoshop.....to "map into memory" on i7 machines with
RAID0 and/or SSD drives that's because of "data"?

This practical application of the 'premature optimization is the root of
all evil' fallacy[1] came to one of its most (un)funny logical, reductio
ad absurdum, outcomes some years back when Adobe's Acrobat Reader became
so fat and slow that they decided they had to do something about it. Of
course, making it efficient was not an option (it would go against the
dogma) so they created the classical "fast loader daemon" (that holds
key parts of Acrobat Reader always in memory so that it would start
faster) - and the joke was that FoxitReader came to the scene, it could
read PDFs just like Acrobat Reader but was smaller than even the "fast
loader daemon" of the latter - i.e. it was what Acrobat Reader was
supposed to be "prematurely optimized" to from the beginning...

Or what about the move to SSDs, where you keep you programs on the
fast&expensive SSDs and "data" on conventional disks?

[1]
which should read 'premature optimization _for speed_ is the root of all
evil' (i.e. 'always optimise for size, and for speed only where and when
needed')

> And the net effect of these optimization attributes on a real
> program is yet to be seen.

So, all the compiler vendors added those (or the even more complex
things like PGO or App Thining) simply because they had nothing better
to do?

No app is an island (except maybe desktop games:) - you may think that
it does not matter that you chose Qt (or some similar bloatware) to
implement your single-dialog application which mostly sits in the tray
because you noticed no delays "on modern hardware" - a user whose
startup time is prolonged by N seconds because his systray is filled by
a dozen such "oh it doesn't matter" utilities may feel quite differently...

>>> Simply organizing code into functions properly and using
>>> BOOST_LIKELY/UNLIKELY where needed will do the thing.
>>
>> No it will not (at least not w/o PGO)
>
> These hints don't require PGO; they work without it.

Neither did I say they do - merely that those hints will _not_ do the
thing (they simply do not serve that purpose - they are
function-internal hints) while "organizing code into functions properly"
_with_ PGO should do the thing (at least to a significant degree)...

>> as the compiler cannot deduce these things (except for simple
>> scenarios like assuming all noreturn functions are cold)...and
>> saying that we can/should then help it with BOOST_LIKELY while
>> arguing that we shouldn't help it with
>> BOOST_COLD/MINSIZE/OPTIMIZE_FOR_* is 'beyond self
>> contradicting'...
>
> The difference is the amount of effort you have to put into it and
> the resulting portability and effect.

Huh? The effort is less, portability the same (macros?) and the effect
is better (simply because those are better tools for the job)??

> The other difference is in the amount of control the user has over
> the resulting code compilation. This important point you seem to
> disregard.

I already answered this point:
* from my perspective (and experience), libs that correctly 'mark' their
code give _me_ more freedom with _my_ code (and compiler options) w/o
fear of detrimental effect on their code
* I can't think of a situation where one would want to optimise cold
code for-speed - it would only make the whole thing run slower or same
at best
* only places where one would want even hot paths optimised for size are
things like 'ultrahardcore embedded targets' (where the restraints are
in kilobytes), bootloaders and/or 4kB/64kB demo competitions:
  - these are too specific to cripple all others because of them
  - usually already use custom solutions (even for the most basic things
like the C runtime)
  - the problem (if we can call it that and if it really exists) is only
with the hot/optimise-for-speed hints - and these can easily be made
'disableable' (i.e. redefined to nothing)
* finally, why this objection is bogus is that you could by the same
rationale ask that the user has to have control over how much effort a
lib dev puts into optimising various parts of the library...

>>> What I was saying is that it's the user who has to decide
>>> whether to build your code for size or for speed or for debug.
>>> That includes the parts of the code that you, the library
>>> author, consider performance critical or otherwise.
>>
>> I'm sorry I fail to take this as anything else than just pointless
>> nagging for the sake of nagging (and we are not talking about debug
>> builds here).
>
> I am talking about debud builds in particular. If I build a debug
> binary, I want to be able to step through every piece of code,
> including the ones you marked for speed. If I build for binary size,
> I want to miminize size of all code, including the one you marked. I
> don't care for speed in either of these cases.

Debug builds are a red herring - per function attributes like hot, cold,
optsize... do not affect debug builds or debug information. Pragmas that
mark whole blocks of code might affect them (depending on the compiler
and particular pragma/macro) but it is trivial to have those defined
empty in debug builds...
A similar thing holds for opt-for-size builds (besides all things
already said regarding that above): the per function attributes should
have no effect there.

> For example, if your code relies on strict IEEE 754 you may want to
> mark the function with -fno-fast-math.

Isn't that exactly (part of) the argument I put out for the fastmath
macros?

>>> You may want to restrict his range of choices, e.g. when a
>>> certain optimization breaks your code.
>>
>> More strawman 'ivory towering'...how exactly am I restrictring
>> anyones choices? A real world example please?
>
> Read that quote again, please.
..
> Or if your library is broken with LTO on gcc older than 5.1 (like
> Boost.Log, for instance) you might want to add -fno-lto to your
> library build scripts.
..
> Thing is there are so many things that may potentially break the
> code, most of which you and I are simply unaware of that this kind of
> defensive practice just isn't practical.

That's a completely different story (compiler codegen bugs)...

-- 
C++ >In The Kernel< Now!

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk