Boost logo

Boost Users :

Subject: Re: [Boost-users] container with large number of elements and a small number of invalide elements
From: Jeff Flinn (Jeffrey.Flinn_at_[hidden])
Date: 2013-11-05 13:30:05


On 11/5/2013 12:43 PM, MM wrote:
> On 4 November 2013 18:18, MM <finjulhich_at_[hidden]
> <mailto:finjulhich_at_[hidden]>> wrote:
>
> On 3 November 2013 22:55, Jeff Flinn <Jeffrey.Flinn_at_[hidden]
> <mailto:Jeffrey.Flinn_at_[hidden]>> wrote:
>
> On 11/1/2013 7:24 PM, Travis Gockel wrote:
>
> On Fri, Nov 1, 2013 at 1:18 PM, Jeff Flinn
> <Jeffrey.Flinn_at_[hidden] <mailto:Jeffrey.Flinn_at_[hidden]>
> <mailto:Jeffrey.Flinn_at_gmail.__com
> <mailto:Jeffrey.Flinn_at_[hidden]>>> wrote:
>
> On 11/1/2013 1:28 PM, MM wrote:
>
> On 1 November 2013 12:37, Jeff Flinn
> <Jeffrey.Flinn_at_[hidden] <mailto:Jeffrey.Flinn_at_[hidden]>
> <mailto:Jeffrey.Flinn_at_gmail.__com
> <mailto:Jeffrey.Flinn_at_[hidden]>>
> <mailto:Jeffrey.Flinn_at_gmail.
> <mailto:Jeffrey.Flinn_at_gmail.>____com
>
> <mailto:Jeffrey.Flinn_at_gmail.__com
> <mailto:Jeffrey.Flinn_at_[hidden]>>>> wrote:
> On 10/31/2013 1:14 PM, MM wrote:
> On 31 October 2013 16:10, Klaim - Joël Lamotte
> <mjklaim_at_[hidden]
> <mailto:mjklaim_at_[hidden]> <mailto:mjklaim_at_[hidden]
> <mailto:mjklaim_at_[hidden]>>
> <mailto:mjklaim_at_[hidden]
> <mailto:mjklaim_at_[hidden]> <mailto:mjklaim_at_[hidden]
> <mailto:mjklaim_at_[hidden]>>>
> <mailto:mjklaim_at_[hidden]
> <mailto:mjklaim_at_[hidden]> <mailto:mjklaim_at_[hidden]
> <mailto:mjklaim_at_[hidden]>>
> <mailto:mjklaim_at_[hidden]
> <mailto:mjklaim_at_[hidden]> <mailto:mjklaim_at_[hidden]
> <mailto:mjklaim_at_[hidden]>>>>> wrote:
>
> What would "valid"/"invalid" means here?
> If it's "exist" or "don't exist",
> then maybe using
> boost::optional<MyPod> would be enough.
>
> That's right. How about sizeof(
> boost::optional<MyPod>
> ) vs sizeof(
> MyPOD ) ?
> Also, what sort of iterator can iterate
> over just the valid
> values? or
> how to implement one?
> The point here is that there's a small
> number of
> elements that are
> invalid, their indices in the container
> are random,
>
> So why not keep the invalid items from getting
> into the
> container in
> the first place?
>
> Jeff
>
> because they are processed at a later stage, their
> indices in the
> container are relevant,
>
>
> I'd need a better description of just what defines and
> item to be
> valid/invalid. And just what this processed at a later
> stage means.
>
> If you've some functions:
>
> bool valid(const MyPod& ) { return ...; }
> bool invalid(const MyPod& pod) { return !valid(pod); }
>
> , and a pre-existing container then you can get an
> adapted view
> using boost::range::adaptors::____filtered like:
>
>
> typedef std::vector<MyPod> MyPods;
>
> MyPods all(...);
> MyPods bad;
>
> boost::copy(all | filtered(invalid),
> std::back_inserter(bad));
>
> boost::foreach(all | filtered(valid), someFunction);
>
> It all depends on the context of where this is used.
> Another
> approach uses the boost icl (interval container
> library). One
> example is:
>
> http://www.boost.org/doc/libs/____1_55_0b1/libs/icl/doc/html/____boost_icl/projects.html#__boost___icl.projects.large___bitset
> <http://www.boost.org/doc/libs/__1_55_0b1/libs/icl/doc/html/__boost_icl/projects.html#boost___icl.projects.large_bitset>
>
>
> <http://www.boost.org/doc/__libs/1_55_0b1/libs/icl/doc/__html/boost_icl/projects.html#__boost_icl.projects.large___bitset
> <http://www.boost.org/doc/libs/1_55_0b1/libs/icl/doc/html/boost_icl/projects.html#boost_icl.projects.large_bitset>>
>
> I've used a similar approach to just store the "valid"
> bases in a
> dna sequences and keeping track of gaps with an
> interval container.
> The advantage is avoiding the filtering cost when the
> application is
> primarily interested in the valid items only.
>
>
> I would recommend checking out Google sparsehash --
> specifically,
> sparse_hash_set<T>:
> http://code.google.com/p/__sparsehash/source/browse/__trunk/src/sparsehash/sparse___hash_set
> <http://code.google.com/p/sparsehash/source/browse/trunk/src/sparsehash/sparse_hash_set>
>
> The biggest difference between that and something like an
> std::vector<boost::optional<T>__> is google::sparse_hash_set
> will re-order
> values on you (the nature of hash tables). It also forces
> you to have a
> "deleted key" which exists in the realm of possible T
> values. Based on
> impression that I get from your original email, this fits
> with your idea
> of an "invalid" T.
>
> That said, it is a ridiculously low-overhead container (on
> the order of
> ~2 bits/entry vs 4 bytes/entry with a boost::optional).
>
>
> That's 2x10^6 bits for the OP's case, where-as the ICL approach
> is 64bits per interval. In the OP's case that would be
> 64x100(max number of invalid items assuming all invalid
> separated by atleast 1 valid item) about 3 orders of magnitude less.
>
>
> Jeff
>
>
> Thanks for the various answers. I'm reading about ICL, and
> sparsehash to learn more.
>
> basically,
>
> struct MyPOD { double d1; ..... double d8; }
> and I currently have std::vector<MyPOD> with 1e6 elements.
> 100 of those would be == a particular instance of MyPOD invalid {
> +Inf, +Inf, ... -Inf,, 0. }. It's a known, predefined MyPOD constant
>
> I need:
> 1. calculate the max element, and the min element, _while_ ignoring
> the 100 or so invalid entries (I don't know where they are), I can't
> however change the vector<MyPOD>. it's read only.
>
> 2. I need also to plot it (ie, plot all the elements that are not
> Invalid).
> The index of each element in vector is transformed and returns a
> sort of coordinate to plot a point
>
> So, I would have wrapped this vector into a class, and exposed
> similar interface to vector, but would have implemented an iterator
> that just skips the invallid entries:the iterator would skip the
> invalid entry until next valid one when incremented or decremented.
> My class would also be indexable with [] with this access not with
> the same order as vector's [] (it would skip the invalid entries, so
> slower)
>
> Perhaps the best choice is a boost range adapted view.
>
> Reading the docs....
>
> thanks very much
>
> MM
>
>
> I have wrapped my vector in a class, and used boost range adaptor filtered.
>
> I've implemented operator[](std::size_t i) as a simple
>
> begin is the begin iterator to a filtered view on myvector |
> boost::adapators::filtered( ... ) with a predicate that ignores my
> invalid values.
>
> std::advance(begin, i)
> /// this advance on the view gives a performance equal to what a forward
> only iterator (not random access iterator) would give.
>
> This appears to be too slow in my case, I need a better order of
> magnitude of such access,
>
> I probably should have a solution where I store the indices of the
> invalid values, and I implement
> operator[](std::size_t i) with the help of those indices.

Or modify your algorithm to not require random access. ;-)

What are you wanting to do with the sequence(s)?

Jeff


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net