Boost logo

Boost :

Subject: [boost] [Smart Ptr] make_shared slower than shared_ptr(new) on VC++9 (and 10) with fix
From: Ivan Erceg (ierceg_at_[hidden])
Date: 2012-04-25 15:48:20


Hi all,

Before I switch to using boost::make_shared<> I wanted to test its
purported performance advantage. I created a simple benchmark for measuing
raw allocation throughput for 3 classes of different sizes with a common
base class (constructors and destructors trivial). The number of
allocations was set to 40,000,000 as it was roughly giving me 10 seconds
running time per test.

Unfortunatelly it turns out that on VC++9 (release target with default
optimizations) boost::make_shared is significantly slower than simply doing
boost::shared_ptr(new). Here's the benchmark output:

TestBoostMakeShared 10.577s 3.78179e+006 allocs/s
TestBoostSharedPtrNew 8.907s 4.49085e+006 allocs/s

As you can see boost::make_shared is over 15% slower than
boost::shared_ptr(new) idiom.

Having available VC++10 compiler as well I then compared these results with
std::shared_ptr and std::make_shared implementations that come with that
compiler (but not VC++9). Here are the results:

TestBoostMakeShared 9.688s 4.12882e+006 allocs/s
TestBoostSharedPtrNew 8.252s 4.84731e+006 allocs/s
TestStdMakeShared 5.07s 7.88955e+006 allocs/s
TestStdSharedPtrNew 8.159s 4.90256e+006 allocs/s

While std::shared_ptr(new) performs about the same as
boost::shared_ptr(new), std::make_shared really blows away
boost::make_shared and both shared_ptr(new) tests, being almost twice as
fast as boost::make_shared.

I then profiled the boost::make_shared test to see what's the biggest
performance bottleneck when compared to boost::shared_ptr(new) profiler
run. The culprit was immediately obvious: boost::make_shared test was
spending above 25% of its time in "type_info::operator==(class type_info
const &) const" function. This function was being called indirectly from
boost::make_shared through boost::get_deleter. After digging some more
through the implementation I came to the conclusion that, in this
particular case, we are guaranteed to always be requesting deleter for the
right class (namely T from boost::make_shared<T>). Since boost::shared_ptr
doesn't have a way to retrieve the deleter without using RTTI I decided to
add one and use it from an alternative boost::make_shared. So I did the
following:

1. I added a virtual function to detail::sp_counted_base
(detail\sp_counted_base_w32.hpp):

  virtual void * get_raw_deleter( ) = 0;

2. I implemented get_raw_deleter() function in sp_counted_impl_p
(detail\sp_counted_impl.hpp):

  virtual void * get_raw_deleter( )
  {
    return 0;
  }

3. I implemented get_raw_deleter() function in sp_counted_impl_pd
(detail\sp_counted_impl.hpp):

  virtual void * get_raw_deleter( )
  {
    return &reinterpret_cast<char&>( del );
  }

4. I implemented get_raw_deleter() function in sp_counted_impl_pda
(detail\sp_counted_impl.hpp):

  virtual void * get_raw_deleter( )
  {
    return &reinterpret_cast<char&>( d_ );
  }

5. I added the following function to detail::shared_count:

  void * get_raw_deleter( ) const
  {
    return pi_? pi_->get_raw_deleter( ): 0;
  }

6. I added the following function to shared_ptr<>:

  void * _internal_get_raw_deleter( ) const
  {
    return pn.get_raw_deleter( );
  }

7. I made a separate copy of boost::make_shared function and replaced a
single line from:

  boost::detail::sp_ms_deleter< T > * pd = boost::get_deleter<
boost::detail::sp_ms_deleter< T > >( pt );

to:

  boost::detail::sp_ms_deleter< T > * pd =
static_cast<boost::detail::sp_ms_deleter< T >
*>(pt._internal_get_raw_deleter());

Benchmarking the results afterwards gave me the following results on VC++9:

TestBoostSharedPtrNew 9.204s 4.34594e+006 allocs/s
TestBoostMakeShared 10.499s 3.80989e+006 allocs/s
TestBoostMakeSharedAlt 7.831s 5.1079e+006 allocs/s

My changes translated into almost 35% improvement in allocation speed over
the current implementation of boost::make_shared. Or to put it differently,
they amount to 25+% decrease in running time as we could have supposed from
the profiling results.

Results on VC++10 are similar:

TestBoostSharedPtrNew 8.487s 4.71309e+006 allocs/s
TestBoostMakeShared 9.609s 4.16276e+006 allocs/s
TestStdSharedPtrNew 8.283s 4.82917e+006 allocs/s
TestStdMakeShared 5.039s 7.93808e+006 allocs/s
TestBoostMakeSharedAlt 6.802s 5.88062e+006 allocs/s

VC++10's std::make_shared is still much faster (almost 35% faster than
boost::shared_ptr) and we will be switching to it once we switch to VC++10.
But in the meantime it seems to me that boost::make_shared should be fixed
to improve the performance. Again, this is only one compiler and other
compilers might not have such a severe RTTI performance issue but I still
think it would be well worth avoiding unnecessary calls to RTTI during
performance-relevant operations such as heap allocations.

The testing and changes were done on Boost 1.48.0 but I compared Smart Ptr
library sources with Boost 1.49.0 and the above changes should work there
equally well.

Thanks,
Ivan


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk