Boost logo

Boost Users :

Subject: Re: [Boost-users] container with large number of elements and a small number of invalide elements
From: MM (finjulhich_at_[hidden])
Date: 2013-11-04 13:18:18


On 3 November 2013 22:55, Jeff Flinn <Jeffrey.Flinn_at_[hidden]> wrote:

> On 11/1/2013 7:24 PM, Travis Gockel wrote:
>
>> On Fri, Nov 1, 2013 at 1:18 PM, Jeff Flinn <Jeffrey.Flinn_at_[hidden]
>> <mailto:Jeffrey.Flinn_at_[hidden]>> wrote:
>>
>> On 11/1/2013 1:28 PM, MM wrote:
>>
>> On 1 November 2013 12:37, Jeff Flinn <Jeffrey.Flinn_at_[hidden]
>> <mailto:Jeffrey.Flinn_at_[hidden]>
>> <mailto:Jeffrey.Flinn_at_gmail.__com
>>
>> <mailto:Jeffrey.Flinn_at_[hidden]>>> wrote:
>> On 10/31/2013 1:14 PM, MM wrote:
>> On 31 October 2013 16:10, Klaim - Joël Lamotte
>> <mjklaim_at_[hidden] <mailto:mjklaim_at_[hidden]>
>> <mailto:mjklaim_at_[hidden] <mailto:mjklaim_at_[hidden]>>
>> <mailto:mjklaim_at_[hidden] <mailto:mjklaim_at_[hidden]>
>> <mailto:mjklaim_at_[hidden] <mailto:mjklaim_at_[hidden]>>>> wrote:
>>
>> What would "valid"/"invalid" means here?
>> If it's "exist" or "don't exist", then maybe using
>> boost::optional<MyPod> would be enough.
>>
>> That's right. How about sizeof( boost::optional<MyPod>
>> ) vs sizeof(
>> MyPOD ) ?
>> Also, what sort of iterator can iterate over just the
>> valid
>> values? or
>> how to implement one?
>> The point here is that there's a small number of
>> elements that are
>> invalid, their indices in the container are random,
>>
>> So why not keep the invalid items from getting into the
>> container in
>> the first place?
>>
>> Jeff
>>
>> because they are processed at a later stage, their indices in the
>> container are relevant,
>>
>>
>> I'd need a better description of just what defines and item to be
>> valid/invalid. And just what this processed at a later stage means.
>>
>> If you've some functions:
>>
>> bool valid(const MyPod& ) { return ...; }
>> bool invalid(const MyPod& pod) { return !valid(pod); }
>>
>> , and a pre-existing container then you can get an adapted view
>> using boost::range::adaptors::__filtered like:
>>
>>
>> typedef std::vector<MyPod> MyPods;
>>
>> MyPods all(...);
>> MyPods bad;
>>
>> boost::copy(all | filtered(invalid), std::back_inserter(bad));
>>
>> boost::foreach(all | filtered(valid), someFunction);
>>
>> It all depends on the context of where this is used. Another
>> approach uses the boost icl (interval container library). One
>> example is:
>>
>> http://www.boost.org/doc/libs/__1_55_0b1/libs/icl/doc/html/_
>> _boost_icl/projects.html#boost___icl.projects.large_bitset
>>
>> <http://www.boost.org/doc/libs/1_55_0b1/libs/icl/doc/
>> html/boost_icl/projects.html#boost_icl.projects.large_bitset>
>>
>> I've used a similar approach to just store the "valid" bases in a
>> dna sequences and keeping track of gaps with an interval container.
>> The advantage is avoiding the filtering cost when the application is
>> primarily interested in the valid items only.
>>
>>
>> I would recommend checking out Google sparsehash -- specifically,
>> sparse_hash_set<T>:
>> http://code.google.com/p/sparsehash/source/browse/
>> trunk/src/sparsehash/sparse_hash_set
>>
>> The biggest difference between that and something like an
>> std::vector<boost::optional<T>> is google::sparse_hash_set will re-order
>> values on you (the nature of hash tables). It also forces you to have a
>> "deleted key" which exists in the realm of possible T values. Based on
>> impression that I get from your original email, this fits with your idea
>> of an "invalid" T.
>>
>> That said, it is a ridiculously low-overhead container (on the order of
>> ~2 bits/entry vs 4 bytes/entry with a boost::optional).
>>
>
> That's 2x10^6 bits for the OP's case, where-as the ICL approach is 64bits
> per interval. In the OP's case that would be 64x100(max number of invalid
> items assuming all invalid separated by atleast 1 valid item) about 3
> orders of magnitude less.
>
>
> Jeff
>

Thanks for the various answers. I'm reading about ICL, and sparsehash to
learn more.

basically,

struct MyPOD { double d1; ..... double d8; }
and I currently have std::vector<MyPOD> with 1e6 elements.
100 of those would be == a particular instance of MyPOD invalid { +Inf,
+Inf, ... -Inf,, 0. }. It's a known, predefined MyPOD constant

I need:
1. calculate the max element, and the min element, _while_ ignoring the 100
or so invalid entries (I don't know where they are), I can't however change
the vector<MyPOD>. it's read only.

2. I need also to plot it (ie, plot all the elements that are not Invalid).
The index of each element in vector is transformed and returns a sort of
coordinate to plot a point

 So, I would have wrapped this vector into a class, and exposed similar
interface to vector, but would have implemented an iterator that just skips
the invallid entries:the iterator would skip the invalid entry until next
valid one when incremented or decremented.
My class would also be indexable with [] with this access not with the same
order as vector's [] (it would skip the invalid entries, so slower)

Perhaps the best choice is a boost range adapted view.

Reading the docs....

thanks very much

MM



Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net