Boost logo

Boost :

From: Michael Fawcett (michael.fawcett_at_[hidden])
Date: 2007-03-19 09:31:17


On 3/17/07, Michael Fawcett <michael.fawcett_at_[hidden]> wrote:
> On 3/17/07, Gennadiy Rozental <gennadiy.rozental_at_[hidden]> wrote:
> >
> > "Michael Fawcett" <michael.fawcett_at_[hidden]> wrote in message
> > news:bc5bffe80703161407ybbc7c71ja06b0c2230c5602c_at_mail.gmail.com...
> > > On 3/16/07, Gennadiy Rozental <gennadiy.rozental_at_[hidden]> wrote:
> > >> So? could you give an example of an algorithm? And/or other speific
> > >> performace advantage example?
> >
> > Would you care to share with us a source code for your test?
> >
> > > I just constructed a quick test and used VTune and std::clock to
> > > sample. Here are my results:
> > >
> > > sorting across 1,000,000 random integers (averaged across 5 runs)
> > > std::list::sort - 6.87s
> > > boost::ilist::sort - 4.45s
> > >
> > > number of L2 cache requests (L1 cache misses)
> > > std::list - 1,790
> > > boost::ilist - 862
> >
> > How about std::vector?
>
> Sure, I'm at home now, but I can post it from work on Monday. I
> didn't test vector, but I assume it would beat intrusive list in speed
> and have fewer cache misses as well.

I've attached the source file and pasted the code below my signature.
It does make use of the VTune API, but if you undef USE_VTUNE at the
top of the file, it will compile without VTune.

A note about the block_prefetch code. I simply grabbed that from a
paper written by AMD. My hope was that it would clear the cache
before running the sort so that the previously generated nodes were
not already present before running the sort, but I did not analyze the
generated code to see if that actually happened. There was a note in
AMD's text that optimizers commonly remove the code in block_prefetch
entirely, so it may be useless.

--Michael Fawcett

#define USE_VTUNE
#define USE_BOOST
//#define USE_STDLIB

#include <algorithm>
#include <cstdlib>
#include <ctime>
#include <iostream>
#ifdef USE_STDLIB
#include <list>
#endif
#include <ostream>
#include <vector>

#ifdef USE_VTUNE
#include "vtuneapi.h"
#endif

#ifdef USE_BOOST
#include "boost/intrusive/ilist.hpp"
#endif
#include "boost/shared_array.hpp"

#ifndef USE_VTUNE
#define VTResume()
#define VTPause()
#endif

#ifdef USE_BOOST
struct my_tag;
struct MyNode : public boost::intrusive::ilist_base_hook<my_tag>
{
        MyNode(int n) : num(n) { }
        int num;
        bool operator<(const MyNode &rhs) const
        {
                return num < rhs.num;
        }
};
typedef boost::intrusive::ilist<boost::intrusive::ilist_base_hook<my_tag>::value_traits<MyNode>
> list;
#elif defined(USE_STDLIB)
typedef int MyNode;
typedef std::list<int> list;
#endif

#define CACHEBLOCK 128 // prefetch block size
#define PREFETCH_BLOCK_SIZE CACHEBLOCK * 64

#pragma optimize("", off)
int fetch;
void block_prefetch(const void *addr)
{
        // Grab every 64th address to hit each cache line once.
        const int *a = (int *)addr;
        fetch += a[0] + a[16] + a[32] + a[48]
                        + a[64] + a[80] + a[96] + a[112]
                        + a[128] + a[144] + a[160] + a[176]
                        + a[192] + a[208] + a[224] + a[240];
        a += 256;
        fetch += a[0] + a[16] + a[32] + a[48]
                        + a[64] + a[80] + a[96] + a[112]
                        + a[128] + a[144] + a[160] + a[176]
                        + a[192] + a[208] + a[224] + a[240];
        a += 256;
        fetch += a[0] + a[16] + a[32] + a[48]
                        + a[64] + a[80] + a[96] + a[112]
                        + a[128] + a[144] + a[160] + a[176]
                        + a[192] + a[208] + a[224] + a[240];
        a += 256;
        fetch += a[0] + a[16] + a[32] + a[48]
                        + a[64] + a[80] + a[96] + a[112]
                        + a[128] + a[144] + a[160] + a[176]
                        + a[192] + a[208] + a[224] + a[240];
}
#pragma optimize("", on)

int main(void)
{
        const int NumTestNodes = 1000000;
        std::srand(static_cast<unsigned int>(std::time(0)));

        std::vector<MyNode> my_nodes;
        std::generate_n(std::back_inserter(my_nodes), NumTestNodes, std::rand);
        std::vector<boost::shared_array<int> > fragmented_memory;
        list my_list;

        for (int i = 0; i < NumTestNodes; ++i)
        {
                // Simulate fragmented memory
                boost::shared_array<int> arr(new int[rand() % 500]);
                fragmented_memory.push_back(arr);

                my_list.push_back(my_nodes[i]);
        }

        // Clear the cache
        boost::shared_array<char> dummy(new char[4096]);
        block_prefetch(dummy.get());

        std::clock_t start_time = std::clock();
        VTResume();
        my_list.sort();
        VTPause();
        std::cout << "time for sorting: " << static_cast<double>(std::clock()
- start_time) / CLOCKS_PER_SEC << std::endl;
        my_list.clear();
}


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk