Boost logo

Boost Users :

Subject: Re: [Boost-users] Boost.Functional/Hash
From: Robert Ramey (ramey_at_[hidden])
Date: 2010-09-02 18:15:02


Marsh Ray wrote:
> On 09/02/2010 04:06 AM, Andrej van der Zee wrote:
>> Hi,
>>
>> I have a question that I hope can be answered by using Boost. I have
>> the following datastructure:
>>
>> struct ip_info
>> {
>> const u_int8_t _ifile_index;
>> const unsigned long _src_ip;
>> const unsigned long _dst_ip;
>> const u_int16_t _id;
>> const u_int16_t _length;
>> const u_int16_t _src_port;
>> const u_int16_t _dst_port;
>> const u_int32_t _seq;
>> const u_int32_t _ack_seq;
>> };
>>
>> The object is identified by all its members. Now I want to count
>> duplicates. But I have MANY of these objects, I mean billions. I have
>> to reduce the memory footprint of my program, because it soon grows
>> over 4GB and boom! So I rather not keep all the objects in memory and
>> why would I, because I do not need them. I just need to count
>> duplicates.
>
> My, what a clear statement of a classic problem. Hmm, are you sure
> this isn't homework? :-)

I know I'm not supposed to do this but ....

Here is an easy solution.

a) create a file of all the objects. Divide the file size by the record
length to get number
of records.
b) sort them using the postman's sort evaluation version (available at
www.rrsd.com)
using the -u switch to eliminate duplicates.
c) calculate the output file size using the method in a) above
d) the difference between the two numbers is the number of duplicates.

On my 5 year old machine, sorting 1 GB takes about 1 minute. Since your
records are about 2 bytes long it would be 20 min/billion records. I doubt
you'd find a faster way.

Robert Ramey


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net