Boost logo

Boost :

Subject: Re: [boost] [config] RFC PR 82
From: Andrey Semashev (andrey.semashev_at_[hidden])
Date: 2015-12-16 00:31:29


On Wednesday, December 16, 2015 12:02:26 AM Domagoj Saric wrote:
> On Tue, 01 Dec 2015 16:40:33 +0530, Andrey Semashev
>
> The question was about BOOST_NO_CXX11_FINAL: "'purists' might mind that
> BOOST_NO_CXX11_FINAL is not defined even when BOOST_FINAL is defined to
> sealed instead of final"...

I'm not seeing those 'purists' in this discussion. And I'm not getting an
impression that you're one of them. As far as I'm concerned, it doesn't matter
how BOOST_FINAL is implemented as long as it conforms to the documented
behavior (which, I assume, would be equivalent to the behavior of C++11
final).

> >>>>> I don't see the benefit of BOOST_NOTHROW_LITE.
> >>>>
> >>>> It's a nothrow attribute that does not insert runtime checks
> >>>> to call std::terminate...and it is unfortunately not offered
> >>>> by Boost.Config...
> >>>
> >>> Do you have measurments of the possible benefits compared to
> >>> noexcept? I mean, noexcept was advertised as the more efficient
> >>> version of throw() already.
> >>
> >> What more measurements beyond the disassembly window which clearly
> >> shows unnecessary EH codegen (i.e. bloat) are necessary?
> >
> > I'll reiterate, what are the practical benefits? I don't care about a
> > couple instructions there or not there - I will never see them in
> > performance numbers or binary size.
>
> I guess then you were also against noexcept with the same presumptive 'a
> couple of instructions (compared to throw())' rationale?
> What is the N in the "N * a-couple-instructions" expression at which you
> start to care?

That's just handwaving and I was interested in some practical evidence that
BOOST_NOTHROW_LITE would be beneficial compared to noexcept. You haven't
presented any so far.

> What kind of an arugment is that anyway, i.e. why should anyone care
> that you don't care?

Well, you asked for community opinion and I expressed mine. If you're not
interested in it then say so and I won't waste everyone's time. This remark
relates to the tone of the rest of your reply.

> How does it relate to whether or not BOOST_NOTHROW
> should be changed (or at least BOOST_NOTHROW_LITE added) to use the
> nothrow attribute where available instead of noexcept (especially since
> the macro itself cannot guarantee C++11 noexcept semantics anyway)?

First, there is no BOOST_NOTHROW macro currently. There is BOOST_NOEXCEPT, and
it does correspond to C++11 noexcept when it is supported by the compiler. It
does not emulate noexcept semantics in C++03, but that was never implied.

Second, there is BOOST_NOEXCEPT_OR_NOTHROW and it does switch between throw()
in C++03 and noexcept in C++11. I'm not sure if you mean this macro by
BOOST_NOTHROW but I don't think changing it to __attribute__((nothrow)) is a
good idea because this is a breaking change (both in behavior and
compilation).

Third, I did not propose to change semantics of the existing macros, nor
commented in relation to this idea. I'm sceptical about introducing a new one,
BOOST_NOTHROW_LITE.

> You ask for practical benefits and then give a subjective/off the
> cuff/question-begging reasoning for dismissing them...You may not mind
> that the biggest library in Boost is a logging library of all things
> while some on the other hand would like to see plain C finally retired
> and C++ (and its standard library) be used (usable) in OS kernels[1] as
> well as tiniest devices from the darkest corners of the embedded world
> Some 'wise men' say the free lunch is over...

I'm not sure what you're getting at here.

> [1]
> https://channel9.msdn.com/Events/Build/2014/9-015
> An example discussion of exactly that - at ~0:17:00 they explicitly
> mention drivers - I don't know about you but drivers coded with the "I
> don't care about a couple instructions" mindset don't sound quite
> exciting (even though most are already even worse than that, nVidia
> display driver nvlddmkm.sys 12+MB, Realtek audio driver RTKVHD64.sys
> 4+MB...crazy...)

Given the complexity of modern hardware, I don't find these sizes crazy at
all. That said, I have no experience in driver development, so maybe some part
of craziness is lost on me. And by the way, I don't think those sizes are
relevant anyway as drivers are most likely written in C and not C++. Not in C+
+ with exceptions anyway.

> >>>>> I don't think BOOST_OVERRIDABLE_SYMBOL is a good idea, given
> >>>>> that the same effect can be achieved in pure C++.

[snip]

> But what if you want a 'proper' name for the global variable? You have
> to name the tag type and then create some inline function
> named-like-the-desired variable that will return the
> singleton<Tag>::instance...

I don't see a problem with that.

> + this does not work for static member
> variables or functions...

inline works for functions.

> All compilers are already forced to implement such an attribute
> internally precisely to support code such as you wrote above - so this
> just asks that this be standardized and made public....

They have to implement it internally, but not necessarilly make it public. You
will have to write the portable code anyway, so what's the point in compiler-
specific versions?

> > By using non-default calling conventions you're forcing your users
> > out of the standard C++ land. E.g. the user won't be able to store an
> > address of your function without resorting to compiler-specific
> > keywords or macros to specify the calling convention. It complicates
> > integration of your library with other code. I'd rather strictly ban
> > non-default calling conventions on API level at all.
>
> * no compiler-specific keywords just a documented macro already used by
> the API in question
> * the macro is only needed if you need to declare the pointer/function
> type yourself (instead of just passing the function address to an API,
> using auto, decltype, lambdas or template type deduction or wrapping it
> in something like std::function, signal/slot object...)

Not only that. As calling convention affects type, it also affects template
specialization matching and overload resolution. Have a look at boost::mem_fn
implementation for an example of overloads explosion caused by that. Thanks
but no.

> * explicit calling conventions in (cross platform) public APIs of
> libraries and even OSs are a pretty common thing in my experience

This is common on Windows. I'd say, it's an exception rather than a rule, as I
don't remember any POSIX-like system exposing its system API with a non-
standard calling convention. As for libraries, I can't remember when I last
saw an explicit calling convention in the API.

> * "forcing users out of the standard C++ land" - that's just moot i.e.
> isn't that part of what Boost is about?

It's the opposite of what Boost is about (to my understanding). Boost makes
non-standard and complicated things easy and in the spirit of the standard C+
+. Imposing calling conventions on users is anything but that.

> i.e. there is nothing stopping
> 'us' from standardizing the concept of calling conventions (e.g. to
> specify/handle the different architecture/ABI intricacies of 'evolving
> hardware' - soft/hard float, different GPR file sizes, 'levels' of SIMD
> units etc.)

ABI specs exist for that. Calling conventions are simply semi-legal deviations
from the spec. While they may provide local benefits, I'm opposed to their
creep into API level.

> > There are different kinds of bloat. Force-inlining critical
> > functions of your program will hardly make a significant difference
> > on the total binary size, unless used unwisely or you're in hardcore
> > embedded world where every byte counts.
>
> This assumption can only be true if the 'critical functions of a
> program' (force-inlined into every callsite!) comprise a non-significant
> portion of the entire program...

That is normally the case - the critical part of the program is typically much
smaller than the whole program.

> which is in direct contradiction with
> presumptions you make elsewhere - such as that properly marking cold
> portions of code is just not worth it...

I don't see the contradiction.

> Suddenly you are OK with "restricting users" (those in the 'hardcore
> embedded world') as well as having/using keywords/macros (forceinline)
> that can be used 'wisely' and 'unwisely'?...

Forcing inline in general purpose libraries like Boost can be beneficial or
detrimental, that is obvious. Finding a good balance is what makes the use of
this feature wise. The balance will not be perfect for all environments - just
as your choice of calling conventions or optimization flags.

> + sometimes, still, even with everything inlined, compilers still cannot
> handle even simpler C++ abstractions 'all by them selves'
> https://gist.github.com/rygorous/c6831e60f5366569d2e9

Not sure what that's supposed to illustrate.

> >> For dynamically dispatched calls (virtual functions) choosing the
> >> appropriate c.convention and decorating the function with as many
> >> relevant attributes is even more important (as the dynamic
> >> dispatch is a firewall for the optimiser and it has to assume that
> >> the function 'accesses&throws the whole universe')...
> >
> > My point was that one should avoid dynamic dispatch in hot code in
> > the first place.
>
> AFAICT I first mentioned dynamically dispatched calls.

Umm, so? Not sure I understand.

> + what I already said: it is not just about the direct speed impact but
> about the detrimental impact on the (optimisation) of code surrounding
> the callsites (creating bigger and slower code)...some attributes (like
> noalias, pure and const) can even allow a compiler to hoist a virtual
> call outside a loop...

If a function call can be moved outside of the loop, then why is it inside the
loop? Especially, if you know the piece of code is performance-critical and
the function is virtual?

> > Even when the target is known statically (i.e. non-virtual function
> > call) the effect of the call can be significant if it's on the hot
> > path - regardless of the calling convention.
>
> A static call to a (cached/prefetched) function that does not touch the
> stack has pretty much the overhead of two simple instructions CALL and
> RET (and CPUs have had dedicated circuitry, RSBs, for exactly that for
> ages).

Also the prologue and epilogue, unless the function is really trivial, at
which point it can probably be inlined. The function call, as I'm sure you
know, involves writing the return address to stack anyway. And if the function
has external linkage the call will likely be made through a symbol table
anyway. That increases the pressure on the TLB, which may affect the
performcance of your performance-critical function if it is memory-intensive
and makes a few calls itself.

> Please give me an example of a function not automatically inlined (even
> at Os levels) where this is a 'significant effect' (moreover even if you
> could, that still wouldn't prove your point - all that is needed to
> disprove it is the existence of a function whose call overhead is made
> insignificant by using a better c.convention and appropriate attributes
> - trivial)...

My experience in this area mostly comes from image processing algorithms, like
scaling or color transform, for example. Each pixel (or a vector thereof) may
have to be processed in a rather complex way, such that the functions that
implement this often do not inline even at -O3. I experimented with various
calling conventions, including __attribute__((regparm)), but eventually
forcing inline gave the best results.

> > and the typical page size is 4k - you'd have to save at least 4k of
> > code to measure the difference, let alone feel it.
>
> Let's try and see how hard it is to save something with Boost.Log:
> - VS2015 U1
> - combined sizes of both library DLLs
> - bjam variant=release link=shared runtime-link=shared address-model=32
> optimization=%1 cxxflags="/Ox%2 /GL /GS- /Gy"
> where (%1 = space, %2 = s) and (%1 = speed, %2 = t) for for-size and
> for-speed optimised builds, repsectively:
> * for-speed build: 959kB
> * for-size build: 730kB
> -> that's a delta of 229kB or 31% (the relative difference is much
> larger if we compare only the text/code sections of the DLLs because of
> all the strings, RTTI records etc...)
> And according to your own assumption that hot code is insignificant in
> size, it follows that you can shave off 30% from Boost.Log exactly by
> having non-hot code compiled for size...

Thing is, I don't know what part of Boost.Log will be hot or cold in the
actual application. (I mean, I could guess that some particular parts are most
likely going to be cold, but I can't make any guesses about the rest of the
code because its use is dependent on what the application uses of the
library). Now let's assume I marked some parts hot and others cold - what
happens when my guess is incorrect? Right, some cold parts are loaded, along
with the really-cold parts, and some hot parts are not loaded. Back to square
one.

You could argue that you know for certain what parts will be hot in your
library. Fair enough, such markup could be useful for you.

> > The disk space consumption by data exceeds code by magnitudes, which
> > in turn shows on IO, memory and other related stuff.
>
> If you say so, like for example:
> * CMake 3.4.1 Win32 build:
> - ~40MB total size
> - of that ~24MB are binaries and the rest is mostly _documentation_
> (i.e. not program data)
> - cmake-gui.exe, a single dialog application
> * Git4Windows 2.6.3 Win64 build:
> - ~415MB (!?) total size
> - of that ~345MB are binaries

Not sure what you've downloaded, but the one I've found weighs about 29MB.

https://git-scm.com/download/win

Also, these numbers should be taken with a big grain of salt, as we don't know
how much the actual code there is in the binaries. Often there is debug info
or other data embedded into the binaries. Another source of code bloat is
statically linked libraries. The point is that if you're fighting with code
bloat, there are other areas you should first look into before you think about
fine-tuning compiler options on per-function basis.

> So, when you wait for Windows, Visual Studio, Android Studio, Eclipse,
> Adobe Acrobat, Photoshop.....to "map into memory" on i7 machines with
> RAID0 and/or SSD drives that's because of "data"?

There are many factors to performance. And mapping executables into memory,
I'm sure, is by far not the most significant one of them.

> >> as the compiler cannot deduce these things (except for simple
> >> scenarios like assuming all noreturn functions are cold)...and
> >> saying that we can/should then help it with BOOST_LIKELY while
> >> arguing that we shouldn't help it with
> >> BOOST_COLD/MINSIZE/OPTIMIZE_FOR_* is 'beyond self
> >> contradicting'...
> >
> > The difference is the amount of effort you have to put into it and
> > the resulting portability and effect.
>
> Huh? The effort is less, portability the same (macros?) and the effect
> is better (simply because those are better tools for the job)??

Earlier I said that I don't think that the general OPTIMIZE_FOR_SPEED/
OPTIMIZE_FOR_SIZE/etc. macros will work for everyone and everywhere. And
having to specify compiler options manually hampers portability and increases
maintenance effort.

> > I am talking about debud builds in particular. If I build a debug
> > binary, I want to be able to step through every piece of code,
> > including the ones you marked for speed. If I build for binary size,
> > I want to miminize size of all code, including the one you marked. I
> > don't care for speed in either of these cases.
>
> Debug builds are a red herring - per function attributes like hot, cold,
> optsize... do not affect debug builds or debug information.

That's unexpected. Not the debug information, but the ability to step in the
debugger through code, with data introspection, is directly affected by
optimization levels. I don't believe that somehow magically specifying -O3 in
code would provide a better debuggable binary than specifying -O3 in the
compiler command line.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk