Boost logo

Boost :

Subject: Re: [boost] [interval container (itl)] A pretty large bitset
From: Joachim Faulhaber (afojgo_at_[hidden])
Date: 2009-11-03 08:47:04


Hi Zach,

2009/11/3 Zachary Turner <divisortheory_at_[hidden]>:
> On Mon, Nov 2, 2009 at 11:25 AM, Joachim Faulhaber <afojgo_at_[hidden]>wrote:
>
>> I'd like to ask Zachary Turner to comment on the
>> use case. Can this be useful in your field?
>
> This is interesting and at some point I might download the library and play
> with it.  Is there any easy way for me to determine the memory usage of an
> interval_map / interval_set?

interval_map / interval_set are implemented via std::map / std::set.
So for every node carrying an interval / interval-value-pair
you have three pointers (left,right,parent) and a color bit.
The interval itself needs 2*sizeof(domain_type) plus
one byte for the border information. Finally a map node
has to provide space for the value of it's codomain type.

The number of interval or interval-value nodes is
returned by the function interval_count() or iterative_size()
on all interval containers.

> It's hard to say whether this will offer an advantage in my use case because
> being that I'm interested in primarily a) which blocks on a filesystem are
> allocated, and b) which blocks on a filesystem have changed since the last
> time I checked, it depends heavily on the usage patterns of the user.
>
> The which-blocks-are-allocated bitset would be more likely to *not* benefit
> from this type of additional compression, or at least not significantly,
> because it would be very rare on a disk to have a  significantly high
> concentration of alternating bits,

as I have already mentioned in this thread, the periodic bit compression
is a nice byproduct but should not be overemphasized. Clearly we won't
find periodic patterns very often in file allocation tables ;)

> or extremely short runs.

... hmm, I thought, due to fragmentation there could be more
narrow sequences of changing block patterns that could profit
from bitcompression ...

> The which-blocks-have-changed-since-last-time bitset is a little more
> uncertain.  Obviously file access patterns aren't truly random, but let's
> just say in the worst case scenario of completely random disk access, and
> hence completely random bits being flipped from 0 to 1, how would this
> affect the memory usage of this template?  I feel like this is kind of a
> difficult situation to analyze so I'm not looking for an exact O() storage
> complexity or anything, but just any insight you have to offer.   My current
> solution compresses only intervals and thus also suffers in the case of
> random access, but I wonder if this solution would allow me to save space.

It would allow to save space only, if in addition to interval compression
your blocks-have-changed-patterns allow for additional bit compression:
Clusters of changed-blocks within not too big ranges, that
can be compressed into 'local' bitsets.

In the worst case scenario of randomly and sparsely distributed bits
'large_bitset' has no advantage over an itl::interval_set or your
compressed_bitset implementation (if I understood it right).

> What kind of iteration support is provided?  I often need to iterate through
> each bit that is set to 1, where the value_type of the iterator is the
> 0-based index of the bit in the bitset.  So for example if I have a bitset
> representing the bit string 0110110101100 I would want, or at least I would
> want to be able to easily create, an iterator that returns 1, 2, 4, 5, 7, 9,
> 10.

mini::large_bitset started as a non trivial example on the usefulness
of itl::interval_maps. So this implementation
http://www.herold-faulhaber.de/boost_itl/doc/libs/itl/doc/html/boost_itl/projects.html
aims to be very *mini*mal. Also the more elaborated version 'interval_bitset'
https://svn.boost.org/svn/boost/sandbox/itl/boost/itl_xt/interval_bitset.hpp
is a first shot and not yet very much field tested (therefore not part
of the core library submitted for review).

With itl::interval_bitset you can currently iterate over intervals
only. I am generally reluctant to implement element iterators
for interval containers, because this would be IMO a permanent
invitation for inefficient usage. Apart from that, it wouldn't
be difficult to provide such an iterator for itl::interval_bitset.

> The other main important thing would be the ability to do in-place
> bitwise |, &, ~.

There are, even in mini::large_bitset, three different overloads for
bitwise operators (except for ~) that are performing inplace,
if the inserted range is already in the container.

As a useful addition I think about adding operators o= to
combine whole (unit-) bitsets at a given offset

interval_bitset& interval_bitset::operator o=
    (const std::pair<domain_type, bitset_type>& bits_at);

> Advancing an iterator also needs to have constant-time
> complexity for me, ...

that would be no problem (apart from the reluctance problem ;)

> ... and the bitwise operators should probably be no greater
> than n log(n).

currently
O(log(n)) for bits (elements)
O(log(n)) for unit bitsets
O(n) for intervals of elements (interval defined bitset)
O(m log(n+m)) for interval_bitsets of iterative_size n and m

Thank you for all your comments and suggestions!

Cheers
Joachim


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk