Boost logo

Boost Users :

Subject: Re: [Boost-users] Boost.Functional/Hash
From: Marsh Ray (marsh_at_[hidden])
Date: 2010-09-02 12:14:39


On 09/02/2010 04:06 AM, Andrej van der Zee wrote:
> Hi,
>
> I have a question that I hope can be answered by using Boost. I have
> the following datastructure:
>
> struct ip_info
> {
> const u_int8_t _ifile_index;
> const unsigned long _src_ip;
> const unsigned long _dst_ip;
> const u_int16_t _id;
> const u_int16_t _length;
> const u_int16_t _src_port;
> const u_int16_t _dst_port;
> const u_int32_t _seq;
> const u_int32_t _ack_seq;
> };
>
> The object is identified by all its members. Now I want to count
> duplicates. But I have MANY of these objects, I mean billions. I have
> to reduce the memory footprint of my program, because it soon grows
> over 4GB and boom! So I rather not keep all the objects in memory and
> why would I, because I do not need them. I just need to count
> duplicates.

My, what a clear statement of a classic problem. Hmm, are you sure this
isn't homework? :-)

> My idea was to hash such objects to a smaller values and use that hash
> value as key_type in a map.

Usually you'd want to base it on std::vector and call it a hashtable.

> The value_type of the map would be an
> integer. This something like std::map<int32_t, u_int16_t> or
> something.
>
> Then I though maybe use Boost.Functional/Hash, but are there any
> guarantees that the hash value is unique?

You said:
> The object is identified by all its members.

This strongly implies that there are no "forbidden" combinations of
values. Which means that any guaranteed-unique hash value would have to
be at least as long as the object itself.

Really the whole benefit to using a hash value derives from your
willingness to tolerate multiple objects hashing to the same value.

> How unique are they?

Probably plenty good enough for data structure purposes.

> Are
> there better alternatives?

Probably not :-)

If you only have to do this once and it's only a few GB, try doing it
with an industrial strength database like PostgreSQL. This is not
exactly an elegant C++ solution, but it may very well get the job done
quicker.

Also, there may be an open source large file sort utility that could
preprocess the file in a reasonable time and thus make the counting step
a breeze.

Otherwise, if this will be an ongoing need and you want to write it in C++:

1. Allocate the largest possible std::bitset that can fit into memory.
This is where a 64-bit CPU and OS (even with the same amount of RAM)
really helps.

2. Make a pass over the data, indexing the bitset by the object's
computed hash value modulo the count of bits. Set the indexed bit. Keep
the bitset and pass over the data again, but this time for any bit
that's already set, write that object to a 'potential dups' file.

3. Repeat (2) with a different hash function (perhaps just prefix each
object with some constant) using the previous 'potential dups' file as
input. Do this a few times until the file size stops shrinking. The size
of the resulting file gives an upper bound for the number of duplicate
values. If there aren't many dups, you may have a small data set now.

4a. If the upper bound suggests its practical, just make a conventional
hashtable for mapping the values and their counts. Note that you do need
to keep one of each unique value for comparison, because you said "The
object is identified by all its members".

4b. If the upper bound is too large to to fit all at once in memory,
then make N > 1 passes. Calculate a hash value using yet another variant
of the hash function. On the first pass, process as in (4a) but only for
elements where hash(prefix + object)%N == 0. On the next pass do those
== 1, and so on through N - 1. After each pass, append the objects with
counts > 1 to a results file.

5. Pick selected results and confirm them with an additional pass over
the data.

6. Use boost in your implementation somewhere.

7. Light fuse and get away?

- Marsh


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net