Subject: Re: [boost] [Review] Lockfree review starts today, July 18th
From: Tim Blechmann (tim_at_[hidden])
Date: 2011-07-21 07:20:39
> The documentation talks a bit about false sharing and to some extent about
> cacheline alignment to achieve that, but I don't see that to extent I
> would expect in code. Specifically, how do you ensure that a given object
> (I only looked at ringbuffer) _starts_ on a cacheline boundary.
i am trying to ensure that those parts, which are modified by different threads
are in different cache lines. however i don't necessarily care, if they are at
the beginning of the cache line.
> I only see this weird padding "idiom" that everyone seems to use, but
> nothing to prevent a ringbuffer to be put in the middle of other objects
> that reside on cacheline that are happily write-allocated by other
> threads. For instance, what happens for:
> ringbuffer<foo> x;
> ringbuffer<foo> y;
> Consider a standard toolchain without fancy optimizations. Wouldn't this
> normally result in the x.read_pos and y.write_pos to be allocated on the
> same cacheline.
in this case one could argue, that you should ensure the padding manually :)
nevertheless, there is one point that i should probably address: i should
enforce that neither read index, write index and the actual ringbuffer array use
different cache lines.
> There also doesn't seem to be a way to override the allocation of memory.
> For the kind of low-latency we (as in Morgan Stanley) interested in, we
> may sometimes care about delays from lazy PTE mechanisms that many
> operating system have. If you just simply allocate via new you may get a
> few lazily allocated pages from the OS. A 1ms delay for page fault is
> something we do care about.
> Is there any good way to override the allocation?
if the size of the ringbuffer is specified at runtime, there is no way for it, i
should probably add allocator support. however this will only help, if your
allocators force the memory regions into physical ram by using mlock() or the
like and by touching them to avoid minor page faults.
> Are there any performance targets/tests? E.g. for a ringbuffer, I found a
> test with a variable number of producer and consumers useful, where
> producers feed well-known data and consumers do almost nothing (e.g. just
> add the dequeued numbers or something) and see what kind of feed rates can
> be sustained without the consumer(s) falling behind.
the ringbuffer is a single-producer, single-consumer data structure. if you use
multiple producers, it will be corrupted!
in general i hesitate to add any performance numbers, because the performance
heavily depends on the CPU that is used.
> Lastly, what's going on with all the atomic code in there? Can I assume
> that's just implementation details that overrides things in the current
> Boost.Atomic lib and hence ignore it for the review?
boost.lockfree depends on boost.atomic. for this review, boost.atomic should be
ignored and we probably have to decide later, if we postpone the inclusion until
boost.atomic is reviewed, or if i provide a modified version of boost.atomic as
implementation detail. nevertheless, i have a small wrapper, which could be used
to switch between boost::atomic and std::atomic. unfortunately none of my
compilers implements the necessary parts of the atomic<> template ...