|
Boost : |
Subject: Re: [boost] [Smart Ptr] make_shared slower than shared_ptr(new) on VC++9 (and 10) with fix
From: Yuriy Zubritsky (mt.wizard_at_[hidden])
Date: 2012-04-25 17:02:55
I've discovered the same problem a year ago and got own fix for it. It
outperforms both old boost::make_shared and std::make_shared (just a bit
for the last one).
I used implementation of VC's make_shared as base.
If someone interested I can send patch, but I don't know the correct
process for this.
Thanks
25 ËצÔÎÑ 2012 Ò. 12:48 Ivan Erceg <ierceg_at_[hidden]> ÎÁÐÉÓÁ×:
> Hi all,
>
> Before I switch to using boost::make_shared<> I wanted to test its
> purported performance advantage. I created a simple benchmark for measuing
> raw allocation throughput for 3 classes of different sizes with a common
> base class (constructors and destructors trivial). The number of
> allocations was set to 40,000,000 as it was roughly giving me 10 seconds
> running time per test.
>
> Unfortunatelly it turns out that on VC++9 (release target with default
> optimizations) boost::make_shared is significantly slower than simply doing
> boost::shared_ptr(new). Here's the benchmark output:
>
> TestBoostMakeShared 10.577s 3.78179e+006 allocs/s
> TestBoostSharedPtrNew 8.907s 4.49085e+006 allocs/s
>
> As you can see boost::make_shared is over 15% slower than
> boost::shared_ptr(new) idiom.
>
> Having available VC++10 compiler as well I then compared these results with
> std::shared_ptr and std::make_shared implementations that come with that
> compiler (but not VC++9). Here are the results:
>
> TestBoostMakeShared 9.688s 4.12882e+006 allocs/s
> TestBoostSharedPtrNew 8.252s 4.84731e+006 allocs/s
> TestStdMakeShared 5.07s 7.88955e+006 allocs/s
> TestStdSharedPtrNew 8.159s 4.90256e+006 allocs/s
>
> While std::shared_ptr(new) performs about the same as
> boost::shared_ptr(new), std::make_shared really blows away
> boost::make_shared and both shared_ptr(new) tests, being almost twice as
> fast as boost::make_shared.
>
> I then profiled the boost::make_shared test to see what's the biggest
> performance bottleneck when compared to boost::shared_ptr(new) profiler
> run. The culprit was immediately obvious: boost::make_shared test was
> spending above 25% of its time in "type_info::operator==(class type_info
> const &) const" function. This function was being called indirectly from
> boost::make_shared through boost::get_deleter. After digging some more
> through the implementation I came to the conclusion that, in this
> particular case, we are guaranteed to always be requesting deleter for the
> right class (namely T from boost::make_shared<T>). Since boost::shared_ptr
> doesn't have a way to retrieve the deleter without using RTTI I decided to
> add one and use it from an alternative boost::make_shared. So I did the
> following:
>
> 1. I added a virtual function to detail::sp_counted_base
> (detail\sp_counted_base_w32.hpp):
>
> virtual void * get_raw_deleter( ) = 0;
>
> 2. I implemented get_raw_deleter() function in sp_counted_impl_p
> (detail\sp_counted_impl.hpp):
>
> virtual void * get_raw_deleter( )
> {
> return 0;
> }
>
> 3. I implemented get_raw_deleter() function in sp_counted_impl_pd
> (detail\sp_counted_impl.hpp):
>
> virtual void * get_raw_deleter( )
> {
> return &reinterpret_cast<char&>( del );
> }
>
> 4. I implemented get_raw_deleter() function in sp_counted_impl_pda
> (detail\sp_counted_impl.hpp):
>
> virtual void * get_raw_deleter( )
> {
> return &reinterpret_cast<char&>( d_ );
> }
>
> 5. I added the following function to detail::shared_count:
>
> void * get_raw_deleter( ) const
> {
> return pi_? pi_->get_raw_deleter( ): 0;
> }
>
> 6. I added the following function to shared_ptr<>:
>
> void * _internal_get_raw_deleter( ) const
> {
> return pn.get_raw_deleter( );
> }
>
> 7. I made a separate copy of boost::make_shared function and replaced a
> single line from:
>
> boost::detail::sp_ms_deleter< T > * pd = boost::get_deleter<
> boost::detail::sp_ms_deleter< T > >( pt );
>
> to:
>
> boost::detail::sp_ms_deleter< T > * pd =
> static_cast<boost::detail::sp_ms_deleter< T >
> *>(pt._internal_get_raw_deleter());
>
> Benchmarking the results afterwards gave me the following results on VC++9:
>
> TestBoostSharedPtrNew 9.204s 4.34594e+006 allocs/s
> TestBoostMakeShared 10.499s 3.80989e+006 allocs/s
> TestBoostMakeSharedAlt 7.831s 5.1079e+006 allocs/s
>
> My changes translated into almost 35% improvement in allocation speed over
> the current implementation of boost::make_shared. Or to put it differently,
> they amount to 25+% decrease in running time as we could have supposed from
> the profiling results.
>
> Results on VC++10 are similar:
>
> TestBoostSharedPtrNew 8.487s 4.71309e+006 allocs/s
> TestBoostMakeShared 9.609s 4.16276e+006 allocs/s
> TestStdSharedPtrNew 8.283s 4.82917e+006 allocs/s
> TestStdMakeShared 5.039s 7.93808e+006 allocs/s
> TestBoostMakeSharedAlt 6.802s 5.88062e+006 allocs/s
>
> VC++10's std::make_shared is still much faster (almost 35% faster than
> boost::shared_ptr) and we will be switching to it once we switch to VC++10.
> But in the meantime it seems to me that boost::make_shared should be fixed
> to improve the performance. Again, this is only one compiler and other
> compilers might not have such a severe RTTI performance issue but I still
> think it would be well worth avoiding unnecessary calls to RTTI during
> performance-relevant operations such as heap allocations.
>
> The testing and changes were done on Boost 1.48.0 but I compared Smart Ptr
> library sources with Boost 1.49.0 and the above changes should work there
> equally well.
>
> Thanks,
> Ivan
>
> _______________________________________________
> Unsubscribe & other changes:
> http://lists.boost.org/mailman/listinfo.cgi/boost
>
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk