Boost logo

Boost :

Subject: Re: [boost] Proposal: Monotonic Containers
From: Christian Schladetsch (christian.schladetsch_at_[hidden])
Date: 2009-06-12 21:54:45


This is the detail of the 'more-real-world' test mentioned below. My results
show that monotonic approach is %50 faster in GCC and much faster again in
MSVC. Equally importantly, the monotonic implementation does not fragment
the heap:

// the purpose of this test is to simulate a `realtime system`
void test_map_list_realtime()
{
    monotonic::inline_storage<1000000> storage;
    const size_t outter_loops = 10*1000;
    const size_t inner_loops = 10000;

    boost::timer t0;
    for (int n = 0; n < outter_loops; ++n)
    {
        test_map_list_impl_mono(inner_loops, storage);
        // *** reset the memory usage to zero, hence zero heap fragmentation
**(
        storage.reset();
    }
    double e0 = t0.elapsed();
    cout << "test_map_list: mono: " << e0 << endl;

    boost::timer t1;
    for (int n = 0; n < outter_loops; ++n)
    {
        test_map_list_impl_std(inner_loops);
    }
    double e1 = t1.elapsed();
    cout << "test_map_list: std: " << e1 << endl;
}

// part of the test_map_list_realtime test.
// this is to be like running part of a simulation.
void test_map_list_impl_mono(size_t count, monotonic::storage_base &storage)
{
    typedef monotonic::list<int> List;
    typedef monotonic::map<int, List> Map;
    Map map(storage);
    size_t mod = count/10;
    for (size_t n = 0; n < count; ++n)
    {
        int random = rand() % mod;
        Map::iterator iter = map.find(random);
        if (iter == map.end())
        {
            map.insert(make_pair(random, List(storage)));
        }
        else
        {
            iter->second.push_back(n);
        }
    }
}

// same as test_map_list_impl_mono, but using std::containers
void test_map_list_impl_std(size_t count)
{
    typedef std::list<int> List;
    typedef std::map<int, List> Map;
    Map map;
    size_t mod = count/10;
    for (size_t n = 0; n < count; ++n)
    {
        int random = rand() % mod;
        Map::iterator iter = map.find(random);
        if (iter == map.end())
        {
            map.insert(make_pair(random, List()));
        }
        else
        {
            iter->second.push_back(n);
        }
    }
}

On Sat, Jun 13, 2009 at 1:39 PM, Christian Schladetsch <
christian.schladetsch_at_[hidden]> wrote:

> Hi Luke,
>
> On Sat, Jun 13, 2009 at 3:51 AM, Simonson, Lucanus J <
> lucanus.j.simonson_at_[hidden]> wrote:
>
>> Christian Schladetsch wrote:
>> > Note that your test case is ~8% faster when implemented using
>> > monotonic::vector over using std::vector. This is because the monotonic
>> > vector is using the stack for storage, where the loop controls and other
>> > fields are also.
>>
>> How do you know? Unless you went in and measured with vtune how many
>> cache misses you get with each you are just making things up and stating
>> them as fact. I highly doubt the difference is due to improved spatial
>> locality from allocation being on the stack.
>
>
> I agree that I was brash in my statement, and that I have been wrong
> before. I hypothesised that it would be the case. I argued my case in
> theory, then someone else wrote a benchmark. I tested that benchmark using
> my system and using the std::system. The measurements I took showed my
> system was faster, so I logically took it as support of my theory. If you
> could find the time to look at the test in question, I would be delighted to
> hear alternative theories.
>
>
>> Frank Mori Hess wrote:
>> > Are you setting _SCL_SECURE to zero on msvc? It's a common cause of
>> > performance drops with std containers on msvc.
>> >
>> How many times has someone come to boost with an optimization that isn't
>> because they don't know about _SCL_SECURE? This is how the stl gets the
>> reputation of being slow which so damages people's ability to write sensible
>> code. Forcing everyone who is ignorant of what the compiler is doing to
>> write containers with bugs in them because they don't like the fact that stl
>> is made slow by default in MSVC is hardly improving security.
>
>
> I tested it on GCC and VS. I found that yes indeed, my initial report of a
> 100x improvement was still demonstrable on VS, but it was much more modest
> or non-existant, on GCC.
>
> So I dug a little deeper, ran and re-ran the tests. Same result; VS was
> just much slower than GCC for that test. At first I put it down to being a
> difference in the compiler. After all I thought, 100x improvement on VS,
> even if not reproducible on GCC, was still worthwhile to everyone using VS
> for realtime code. Including, of course, myself.
>
> However, it nagged at me, and later more elaborate tests (which are now in
> the sandbox with revised implementation at
> http://svn.boost.org/svn/boost/sandbox/monotonic) showed that MSVC was
> getting pathologically poor results. Something was wrong.
>
> To cut a short story long, I had once made a change to the project settings
> to use /Zi, which includes debug symbols in the /O2 optimised build. That
> stayed, and also just so happened to give results that showed that the
> std::containers were far slower than the monotonic::containers for MSVC.
>
> A newbie mistake to be sure, compounded by the effect of 'wishful
> thinking', which in this case was compounded by the effect that that
> specific fault gave just the results I had hoped for.
>
> My current benchmarks for VS2008 /O2 (and *not* /Zi) are:
> Incrementing ord = 3.16 ps per iteration
> Incrementing ord = 3.18 ps per iteration
> STD: Incrementing ord = 3.43 ps per iteration
> STD: Incrementing ord = 4.04 ps per iteration
> monotonic: 0.001
> std: 0.002
> monotonic: 0.011
> std: 0.019 <<--- big drop from
> ~1.00!!
> test_map_list: mono: 19.628
> test_map_list: std: 46.226
>
> And for GCC4.3.3 -O4:
> Incrementing ord = 3.1 ps per iteration
> Incrementing ord = 3.1 ps per iteration
> STD: Incrementing ord = 2.8 ps per iteration
> STD: Incrementing ord = 2.9 ps per iteration
> monotonic: 0
> std: 0
> monotonic: 0.02
> std: 0.01 <<--- std:: is actually faster
> on GCC; deviation is high for this test
> test_map_list: mono: 14.5
> test_map_list: std: 21.76
>
> So the results are mixed; except for the test_map_list test, which is 2.3x
> faster using MSVC and 1.5x faster using GCC.
>
> That test is the first more-real-world test provided, and I think is
> starting to get outside of the 'toy benchmarks' area. The implementation is
> at
> https://svn.boost.org/svn/boost/sandbox/monotonic/libs/monotonic/test/main.cpp
> .
>
> This test creates and uses a map<int, list<int> > a few times. It also more
> accurately fits the original motivations outlined in
> https://svn.boost.org/svn/boost/sandbox/monotonic/libs/monotonic/doc/index.html
> .
>
> Is a 50%-200% improvement worth investigating further? Perhaps, perhaps
> not. It is certainly not 'compelling'.
>
> Storing all the nodes of a tree in a buffer instead of scattered in the
>> heap is a good idea. Storing all the nodes of a tree in a buffer of a fixed
>> size that causes a crash when you exceed it is a bad idea.
>
>
> There are many cases in high-performance, realtime code that you can or
> should know and set high-level water marks for memory usage, either
> per-frame or per-configuration or both.
>
> In any case, the same argument can be made against exhausting any memory
> pool, including the heap itself. To claim that the proposed system is faulty
> because it can exhaust an explicitly-sized allocation buffer seems a little
> mischievious.
>
>
>> We get massive performance improvements by using a (non-standard) memory
>> buffer allocator in internal code that implements a variant of R*Tree. We
>> don't crash if we keep adding elements to the tree. We find that in toy
>> benchmarks like Christian shows the performance advantage of storing the
>> nodes of a tree in a contiguous buffer is greatly less than in a real
>> application because even the system allocator will cluster them together in
>> memory when memory is completely free and because everything in a toy
>> benchmark fits in cache anyway.
>
>
> Perhaps. I didn't think "gee, how can I make some wacky boost proposal; I
> know, I'll come up with a strange allocator and try to fefuddle things with
> poorly executed benchmarks!" Perhaps that's how it has progressed thus far,
> however.
>
> Even so, the original idea and implementation came from performance
> analysis of real-world game-system code that I write for 360 and PS3. I saw
> that, especially on the PS3, cache was really important and I didn't want to
> be jumping around all over the heap. I wanted to use standard containers
> (not custom ones), which used either the stack or the heap for memory. I
> especially liked the idea of using the stack, as it was often the case that
> these temporary containers only existing for the lifetime of a stack-frame.
> I also knew that I didn't want memory to be "allocated and freed"; I knew
> how much memory would be required by a static analysis of the code, so I did
> not need a "memory allocator" in the traditional sense.
>
>
>> What Christian is trying to do is in general good, but this idea that
>> allocating the buffer on the stack is somehow a great idea seems to be
>> getting in the way of the portion of his overall proposal that is actually
>> good. He also hasn't
>> yet addressed the design flaw which makes his data structures unsafe to
>> use.
>
>
> If you are referring to the memory alignment issues, the current
> implementation finally does a very basic alignment, after some
> wrong-headedness from me. As noted by Thorsten, it should be changed again
> to use boost::align_memory.
>
> Another important part of my motivation, which hasn't been repeatedly
> explicitly mentioned because it is a "soft" argument and hard to measure, is
> that of heap fragmentation.
>
> This is a huge issue for any real-time system that has to stay up and
> running indefinately, and was also a major motivation for the monotonic
> proposal. Being able to 'wipe the slate clean' after each frame or
> configuration state is vitally important for continuous realtime systems.
>
> This is what the 'real-world-like' test (test_map_list in the output pasted
> above) that currently provides %50-%250 gains in performance demonstrates.
> On top of the performance gains, the test also demonstrates that the
> monotonic system does not fragment the heap over time.
>
> I haven't been pushing this aspect of my proposal a lot because I know it's
> soft and hard to measure, and because it could be argued to be too 'niche'
> of a concern to be much of an issue. So, I've pushed for results, perhaps
> too agressively, in other areas like performance and ease of use.
>
>
>> Performance at the expense of safety is easy to achieve, just program in
>> C.
>>
>> Instead of reinventing the wheel perhaps can we just find some good
>> allocators people have written and compare them? Since there is no
>> std::R*Tree and no way Christian would implement monotonic::R*Tree
>
>
> I wouldn't need to implement monotonic::R*Tree, which would just be said
> hypothetical implementation using a monotonic::allocator.
>
>
>> it would be better to provide a standard compliant allocator people can
>> use on their own containers that are modeled after the standard than
>> re-implement the stl containers with a non-standard compliant allocator.
>
>
> I haven't re-implemented std::containers. I have, perhaps in poor
> judgement, made a proposal that includes a wrapping of them that facilitates
> using them with a monotonic allocator. The proposal can be used either way:
>
> typedef boost::monotonic::map<int, boost::monotonic::list<int> > Map;
>
> or:
>
> typedef std::map<int, std::list<int, boost::monotonic::allocator<int> >,
> std::less<int>, boost::monotonic::allocator<int> > Map;
>
> So far, I have avoided a discussion about the merits and lack thereof of
> the design and practical use of std::allocators.
>
> Monotonic::allocator stores a pointer to storage. If this is an affront to
> the standard, why can and indeed must allocators be compared for equality?
>
> Put another way, if allocators really cannot have any state at all, not
> even a pointer, why must they be compared?
>
> If they really cannot have state, then a work-around is to have a static
> global that points to the storage to use. Hack? Yes, but so are
> std::allocators.
>
>
>> Luke
>
>
> Thanks for your comments Luke, I really appreciate your thoughts on the
> matter.
>
> At the end of the day, we all want the same thing; standards-compliant,
> cross-platform, high-performance programs. If monotonic never sees the light
> of day, but at least starts a discussion about the issues it tries to
> address, and something else comes of it completely, then I would still
> consider that a complete win.
>
> I also need to look at using auto_buffer<T> for storage,
> boost::align_memory, and research R*Trees, and a lot of other things
> besides, so for me personally it has been very beneficial.
>
> So I say thanks guys for all the comments, and I apologise for sometimes
> being obtuse.
>
> Regards,
> Christian.
>
>


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk