Boost logo

Boost Users :

Subject: Re: [Boost-users] container with large number of elements and a small number of invalide elements
From: MM (finjulhich_at_[hidden])
Date: 2013-11-05 12:43:26


On 4 November 2013 18:18, MM <finjulhich_at_[hidden]> wrote:

> On 3 November 2013 22:55, Jeff Flinn <Jeffrey.Flinn_at_[hidden]> wrote:
>
>> On 11/1/2013 7:24 PM, Travis Gockel wrote:
>>
>>> On Fri, Nov 1, 2013 at 1:18 PM, Jeff Flinn <Jeffrey.Flinn_at_[hidden]
>>> <mailto:Jeffrey.Flinn_at_[hidden]>> wrote:
>>>
>>> On 11/1/2013 1:28 PM, MM wrote:
>>>
>>> On 1 November 2013 12:37, Jeff Flinn <Jeffrey.Flinn_at_[hidden]
>>> <mailto:Jeffrey.Flinn_at_[hidden]>
>>> <mailto:Jeffrey.Flinn_at_gmail.__com
>>>
>>> <mailto:Jeffrey.Flinn_at_[hidden]>>> wrote:
>>> On 10/31/2013 1:14 PM, MM wrote:
>>> On 31 October 2013 16:10, Klaim - Joël Lamotte
>>> <mjklaim_at_[hidden] <mailto:mjklaim_at_[hidden]>
>>> <mailto:mjklaim_at_[hidden] <mailto:mjklaim_at_[hidden]>>
>>> <mailto:mjklaim_at_[hidden] <mailto:mjklaim_at_[hidden]>
>>> <mailto:mjklaim_at_[hidden] <mailto:mjklaim_at_[hidden]>>>> wrote:
>>>
>>> What would "valid"/"invalid" means here?
>>> If it's "exist" or "don't exist", then maybe using
>>> boost::optional<MyPod> would be enough.
>>>
>>> That's right. How about sizeof( boost::optional<MyPod>
>>> ) vs sizeof(
>>> MyPOD ) ?
>>> Also, what sort of iterator can iterate over just the
>>> valid
>>> values? or
>>> how to implement one?
>>> The point here is that there's a small number of
>>> elements that are
>>> invalid, their indices in the container are random,
>>>
>>> So why not keep the invalid items from getting into the
>>> container in
>>> the first place?
>>>
>>> Jeff
>>>
>>> because they are processed at a later stage, their indices in the
>>> container are relevant,
>>>
>>>
>>> I'd need a better description of just what defines and item to be
>>> valid/invalid. And just what this processed at a later stage means.
>>>
>>> If you've some functions:
>>>
>>> bool valid(const MyPod& ) { return ...; }
>>> bool invalid(const MyPod& pod) { return !valid(pod); }
>>>
>>> , and a pre-existing container then you can get an adapted view
>>> using boost::range::adaptors::__filtered like:
>>>
>>>
>>> typedef std::vector<MyPod> MyPods;
>>>
>>> MyPods all(...);
>>> MyPods bad;
>>>
>>> boost::copy(all | filtered(invalid), std::back_inserter(bad));
>>>
>>> boost::foreach(all | filtered(valid), someFunction);
>>>
>>> It all depends on the context of where this is used. Another
>>> approach uses the boost icl (interval container library). One
>>> example is:
>>>
>>> http://www.boost.org/doc/libs/__1_55_0b1/libs/icl/doc/html/_
>>> _boost_icl/projects.html#boost___icl.projects.large_bitset
>>>
>>> <http://www.boost.org/doc/libs/1_55_0b1/libs/icl/doc/
>>> html/boost_icl/projects.html#boost_icl.projects.large_bitset>
>>>
>>> I've used a similar approach to just store the "valid" bases in a
>>> dna sequences and keeping track of gaps with an interval container.
>>> The advantage is avoiding the filtering cost when the application is
>>> primarily interested in the valid items only.
>>>
>>>
>>> I would recommend checking out Google sparsehash -- specifically,
>>> sparse_hash_set<T>:
>>> http://code.google.com/p/sparsehash/source/browse/
>>> trunk/src/sparsehash/sparse_hash_set
>>>
>>> The biggest difference between that and something like an
>>> std::vector<boost::optional<T>> is google::sparse_hash_set will re-order
>>> values on you (the nature of hash tables). It also forces you to have a
>>> "deleted key" which exists in the realm of possible T values. Based on
>>> impression that I get from your original email, this fits with your idea
>>> of an "invalid" T.
>>>
>>> That said, it is a ridiculously low-overhead container (on the order of
>>> ~2 bits/entry vs 4 bytes/entry with a boost::optional).
>>>
>>
>> That's 2x10^6 bits for the OP's case, where-as the ICL approach is 64bits
>> per interval. In the OP's case that would be 64x100(max number of invalid
>> items assuming all invalid separated by atleast 1 valid item) about 3
>> orders of magnitude less.
>>
>>
>> Jeff
>>
>
> Thanks for the various answers. I'm reading about ICL, and sparsehash to
> learn more.
>
> basically,
>
> struct MyPOD { double d1; ..... double d8; }
> and I currently have std::vector<MyPOD> with 1e6 elements.
> 100 of those would be == a particular instance of MyPOD invalid { +Inf,
> +Inf, ... -Inf,, 0. }. It's a known, predefined MyPOD constant
>
> I need:
> 1. calculate the max element, and the min element, _while_ ignoring the
> 100 or so invalid entries (I don't know where they are), I can't however
> change the vector<MyPOD>. it's read only.
>
> 2. I need also to plot it (ie, plot all the elements that are not Invalid).
> The index of each element in vector is transformed and returns a sort of
> coordinate to plot a point
>
> So, I would have wrapped this vector into a class, and exposed similar
> interface to vector, but would have implemented an iterator that just skips
> the invallid entries:the iterator would skip the invalid entry until next
> valid one when incremented or decremented.
> My class would also be indexable with [] with this access not with the
> same order as vector's [] (it would skip the invalid entries, so slower)
>
> Perhaps the best choice is a boost range adapted view.
>
> Reading the docs....
>
> thanks very much
>
> MM
>

I have wrapped my vector in a class, and used boost range adaptor filtered.

I've implemented operator[](std::size_t i) as a simple

begin is the begin iterator to a filtered view on myvector |
boost::adapators::filtered( ... ) with a predicate that ignores my invalid
values.

std::advance(begin, i)
/// this advance on the view gives a performance equal to what a forward
only iterator (not random access iterator) would give.

This appears to be too slow in my case, I need a better order of magnitude
of such access,

I probably should have a solution where I store the indices of the invalid
values, and I implement
operator[](std::size_t i) with the help of those indices.

MM



Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net