Boost logo

Boost :

Subject: Re: [boost] Proposal: Monotonic Containers - Comparison with boost::pool, boost::fast_pool and TBB
From: Christian Schladetsch (christian.schladetsch_at_[hidden])
Date: 2009-06-22 17:20:41


Hi Luke,

 [...]
>
> Luke> Peak memory will be a good metric. Do you have access to VTune? You
> seem to struggle to identify the cause of performance loss and are reduced
> to guesswork.

Peak memory measured by an external tool is no good, as boost::pool and
boost::fast_pool both leak memory. I need to be able to sample the memory
used at certain times in the application in a cross-platform way. I'll get
to this in due time.

I have been focused on getting the benchmark results more than attempting to
do a complete analysis of their implications. The latest results are here
http://tinyurl.com/lj6nab. I still can't explain why monotonic is faster at
sorting a 500,000 element pre-reserved vector, but I have only reported the
result and have not investigated deeply.

I have added mean, standard deviation, min and max factors for each of the
small, medium, and large benchmark sets. I print a cumulative total at the
end of each set, and a summary of all results at the end. These summaries
are:

GCC:
     scheme mean std-dev min max
      fast 36.3 173 0.25 1.63e+03
      pool 27.8 1.02e+04 0.857 897
       std 1.69 0.91 0.333 5
       tbb 1.59 0.849 0.333 5

MSVC:
    scheme mean std-dev min max
      fast 35.4 132 0.603 1.32e+003
      pool 27.1 1.13e+004 0.693 878
       std 2.7 1.7 0.628 7
       tbb 1.44 0.727 0.291 6.4

The mean is the average speedup factor provided by monotonic allocation over
the given scheme. So for MSVC, summarised over all tests, monotonic is 1.4X
faster than TBB with a standard deviation of 0.7. TBB was 3.4X faster at its
best and 6.4X slower at its worst.

Note that monotonic was on average 35X faster than boost::fast_pool, but
notice too that the standard deviation is very high. At its worst, fast_pool
was 1,300X slower than monotonic and at its best was 1.6X faster.
boost::pool faired little better, with an even worse standard deviation of
10,000(!). One could argue that the tests are skewed, so I invite you to
look at them and suggest any changes or additions. See
http://tinyurl.com/l6vmgq for all the tests, and
http://tinyurl.com/l89llqfor the test harness.

It is no surprise that TBB performs best across both platforms with the
smallest standard deviation.

I have VTune but I am yet to apply it. I have been attempting to gather the
raw data from which more analysis can be made.

There also exist memory profiling tools. In linux you can poll /proc to get
> memory usage. There are little linux utilities out there called memtime and
> such that work this way.

These can't be used for reasons mentioned above.

> I'd like to suggest another benchmark. This is an example of something I
> think would be very bad for your allocator. Build a std::map of size n.
> Loop for O(n^2) iterations. In each iteration insert one random element
> and lookup with lower_bound one random element and remove it. Precompute
> the random numbers and don't include the rand() calls in the time
> measurement of the benchmark.

I have done so. The code is here http://tinyurl.com/l6vmgq in a structure
called "test_map_erase", the results are here http://tinyurl.com/mp6sub.

For your test, monotonic was at least as fast if not faster than the other
schemes. Of course, for large N memory will be exhausted. But for N=5000,
the summary is:

    scheme mean std-dev min max
      fast 0.996 0.0307 0.957 1.04
      pool 1.2 0.00107 1.16 1.25
       std 1.44 0.025 1.41 1.48
       tbb 1.06 0.0228 1.03 1.1

> Your allocator will use O(n^2) memory since it never recovers deallocated
> memory, so if you set n close to the size of the L1 or L2 cache you should
> see dramatic performance loss. This is my typical use case for a std::map.

This is a common enough idiom, yes. But then again, another common idiom is
to first populate maps then use them for fast lookup, like a dictionary.

I do not suggest, nor have I ever suggested, that monotonic is a
general-purpose allocator that is a drop-in replacement for std::allocator
in all cases. I have repeatedly stated the intended limititation of
monotonic allocation, including mentioning that it is not designed for
continual erasure from and insertion into containers. But it does what it
was designed for very well, which is to produce fast and small
datastructures with very fast allocation.

Your test in this case was specifically designed to be pathological for
monotonic allocation. Even so, the above figures show that it still stands
up very well against other schemes. I exhausted memory for N=100,000, but it
didn't t help that boost::pool leaks it for this test either.

> Regards,
> Luke
>

Thanks for your suggestions and comments Luke. I will be adding more
benchmarks soon, to test this ability of monotonic as well:

    {
        length = 4;
        // storage starts on the stack (in this case, 10k of it), then
merges into the heap as needed
        monotonic::storage<10*1024> storage;
        for (size_t n = 0; n < length; ++n)
        {
            // create a new int from storage
            int &n0 = storage.create<int>();

            // create a new string (uses correct alignment)
            string const &s1 = storage.create<string>("foo");
            BOOST_ASSERT(s1 == "foo");

            // allocate 37 bytes with alignment 1
            char *array0 = storage.allocate_bytes(37);
            fill_n(array0, 37, 42);

            // allocate 2537 bytes with 64-byte alignment
            char *array1 = storage.allocate_bytes(2537, 64);
            fill_n(array1, 2537, 123);

            // allocate 1283 bytes with optimal machine alignment
            char *array2 = storage.allocate_bytes<1283>();
            fill_n(array2, 1283, 42);

            array<Unaligned, 42> &array3 = storage.create<array<Unaligned,
42> >();

            // destroy objects. this only calls the destructors; it does not
release memory
            storage.destroy(s1);

            cout << "storage.stack, heap used: " << storage.fixed_used() <<
", " << storage.heap_used() << endl;
        }
        // storage is released. if this was only ever on the stack, no work
is done
    }

//>>>
storage.stack, heap used: 4882, 0
storage.stack, heap used: 9042, 0
storage.stack, heap used: 9298, 3827
storage.stack, heap used: 9554, 7667

This is harder to test, as no other allocation model that I am currently
testing can do this. But I'll try to test against auto_buffer<>, boost::pool
and boost::object_pool and see how it goes.

I am interested in your thoughts on the latest results, and/or the
methodology I used.

Regards,
Christian.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk