Boost logo

Boost Users :

Subject: Re: [Boost-users] container with large number of elements and a small number of invalide elements
From: MM (finjulhich_at_[hidden])
Date: 2013-11-05 13:29:42


On 5 November 2013 18:09, Vincent N. Virgilio <virgilio_at_[hidden]> wrote:

> On Tue, Nov 5, 2013 at 12:43 PM, MM <finjulhich_at_[hidden]> wrote:
>
>> On 4 November 2013 18:18, MM <finjulhich_at_[hidden]> wrote:
>>
>>> On 3 November 2013 22:55, Jeff Flinn <Jeffrey.Flinn_at_[hidden]> wrote:
>>>
>>>> On 11/1/2013 7:24 PM, Travis Gockel wrote:
>>>>
>>>>> On Fri, Nov 1, 2013 at 1:18 PM, Jeff Flinn <Jeffrey.Flinn_at_[hidden]
>>>>> <mailto:Jeffrey.Flinn_at_[hidden]>> wrote:
>>>>>
>>>>> On 11/1/2013 1:28 PM, MM wrote:
>>>>>
>>>>> On 1 November 2013 12:37, Jeff Flinn <Jeffrey.Flinn_at_[hidden]
>>>>> <mailto:Jeffrey.Flinn_at_[hidden]>
>>>>> <mailto:Jeffrey.Flinn_at_gmail.__com
>>>>>
>>>>> <mailto:Jeffrey.Flinn_at_[hidden]>>> wrote:
>>>>> On 10/31/2013 1:14 PM, MM wrote:
>>>>> On 31 October 2013 16:10, Klaim - Joël Lamotte
>>>>> <mjklaim_at_[hidden] <mailto:mjklaim_at_[hidden]>
>>>>> <mailto:mjklaim_at_[hidden] <mailto:mjklaim_at_[hidden]>>
>>>>> <mailto:mjklaim_at_[hidden] <mailto:mjklaim_at_[hidden]>
>>>>> <mailto:mjklaim_at_[hidden] <mailto:mjklaim_at_[hidden]>>>> wrote:
>>>>>
>>>>> What would "valid"/"invalid" means here?
>>>>> If it's "exist" or "don't exist", then maybe
>>>>> using
>>>>> boost::optional<MyPod> would be enough.
>>>>>
>>>>> That's right. How about sizeof(
>>>>> boost::optional<MyPod>
>>>>> ) vs sizeof(
>>>>> MyPOD ) ?
>>>>> Also, what sort of iterator can iterate over just the
>>>>> valid
>>>>> values? or
>>>>> how to implement one?
>>>>> The point here is that there's a small number of
>>>>> elements that are
>>>>> invalid, their indices in the container are random,
>>>>>
>>>>> So why not keep the invalid items from getting into the
>>>>> container in
>>>>> the first place?
>>>>>
>>>>> Jeff
>>>>>
>>>>> because they are processed at a later stage, their indices in
>>>>> the
>>>>> container are relevant,
>>>>>
>>>>>
>>>>> I'd need a better description of just what defines and item to be
>>>>> valid/invalid. And just what this processed at a later stage means.
>>>>>
>>>>> If you've some functions:
>>>>>
>>>>> bool valid(const MyPod& ) { return ...; }
>>>>> bool invalid(const MyPod& pod) { return !valid(pod); }
>>>>>
>>>>> , and a pre-existing container then you can get an adapted view
>>>>> using boost::range::adaptors::__filtered like:
>>>>>
>>>>>
>>>>> typedef std::vector<MyPod> MyPods;
>>>>>
>>>>> MyPods all(...);
>>>>> MyPods bad;
>>>>>
>>>>> boost::copy(all | filtered(invalid), std::back_inserter(bad));
>>>>>
>>>>> boost::foreach(all | filtered(valid), someFunction);
>>>>>
>>>>> It all depends on the context of where this is used. Another
>>>>> approach uses the boost icl (interval container library). One
>>>>> example is:
>>>>>
>>>>> http://www.boost.org/doc/libs/__1_55_0b1/libs/icl/doc/html/_
>>>>> _boost_icl/projects.html#boost___icl.projects.large_bitset
>>>>>
>>>>> <http://www.boost.org/doc/libs/1_55_0b1/libs/icl/doc/
>>>>> html/boost_icl/projects.html#boost_icl.projects.large_bitset>
>>>>>
>>>>> I've used a similar approach to just store the "valid" bases in a
>>>>> dna sequences and keeping track of gaps with an interval container.
>>>>> The advantage is avoiding the filtering cost when the application
>>>>> is
>>>>> primarily interested in the valid items only.
>>>>>
>>>>>
>>>>> I would recommend checking out Google sparsehash -- specifically,
>>>>> sparse_hash_set<T>:
>>>>> http://code.google.com/p/sparsehash/source/browse/
>>>>> trunk/src/sparsehash/sparse_hash_set
>>>>>
>>>>> The biggest difference between that and something like an
>>>>> std::vector<boost::optional<T>> is google::sparse_hash_set will
>>>>> re-order
>>>>> values on you (the nature of hash tables). It also forces you to have a
>>>>> "deleted key" which exists in the realm of possible T values. Based on
>>>>> impression that I get from your original email, this fits with your
>>>>> idea
>>>>> of an "invalid" T.
>>>>>
>>>>> That said, it is a ridiculously low-overhead container (on the order of
>>>>> ~2 bits/entry vs 4 bytes/entry with a boost::optional).
>>>>>
>>>>
>>>> That's 2x10^6 bits for the OP's case, where-as the ICL approach is
>>>> 64bits per interval. In the OP's case that would be 64x100(max number of
>>>> invalid items assuming all invalid separated by atleast 1 valid item) about
>>>> 3 orders of magnitude less.
>>>>
>>>>
>>>> Jeff
>>>>
>>>
>>> Thanks for the various answers. I'm reading about ICL, and sparsehash to
>>> learn more.
>>>
>>> basically,
>>>
>>> struct MyPOD { double d1; ..... double d8; }
>>> and I currently have std::vector<MyPOD> with 1e6 elements.
>>> 100 of those would be == a particular instance of MyPOD invalid {
>>> +Inf, +Inf, ... -Inf,, 0. }. It's a known, predefined MyPOD constant
>>>
>>> I need:
>>> 1. calculate the max element, and the min element, _while_ ignoring the
>>> 100 or so invalid entries (I don't know where they are), I can't however
>>> change the vector<MyPOD>. it's read only.
>>>
>>> 2. I need also to plot it (ie, plot all the elements that are not
>>> Invalid).
>>> The index of each element in vector is transformed and returns a sort of
>>> coordinate to plot a point
>>>
>>> So, I would have wrapped this vector into a class, and exposed similar
>>> interface to vector, but would have implemented an iterator that just skips
>>> the invallid entries:the iterator would skip the invalid entry until next
>>> valid one when incremented or decremented.
>>> My class would also be indexable with [] with this access not with the
>>> same order as vector's [] (it would skip the invalid entries, so slower)
>>>
>>> Perhaps the best choice is a boost range adapted view.
>>>
>>> Reading the docs....
>>>
>>> thanks very much
>>>
>>> MM
>>>
>>
>> I have wrapped my vector in a class, and used boost range adaptor
>> filtered.
>>
>> I've implemented operator[](std::size_t i) as a simple
>>
>> begin is the begin iterator to a filtered view on myvector |
>> boost::adapators::filtered( ... ) with a predicate that ignores my invalid
>> values.
>>
>> std::advance(begin, i)
>> /// this advance on the view gives a performance equal to what a forward
>> only iterator (not random access iterator) would give.
>>
>> This appears to be too slow in my case, I need a better order of
>> magnitude of such access,
>>
>> I probably should have a solution where I store the indices of the
>> invalid values, and I implement
>> operator[](std::size_t i) with the help of those indices.
>>
>> MM
>>
>> Is std::stable_partition too slow? (Sorry if answered earlier in the
>> thread.)
>>
>> http://www.sgi.com/tech/stl/stable_partition.html
>>
>> Vince
>>
>> I am unable to copy or modify my original vector, I can only browse over
it, iterate over it in different ways.
stable_partition would move around the elements if I'm not mistaken

MM



Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net