Boost logo

Boost Users :

From: Bill Lear (rael_at_[hidden])
Date: 2007-03-09 09:22:07


We have a massive amount of data to serialize, on the order of several
gigabytes. Lots of strings involved, maybe hundreds of millions.

We discovered that the data structure in memory would bloat enormously
when read back in from disk (say, from 2 gig to 3.1 gig). We think we
have tracked this down to (gcc implementation) string reference counts
not being "restored". I think a solution for us is to do something
like the following:

static map<string, bool> string_map;

template <class Archive>
void read_string(Archive ar, string& a_string) {
    string s;
    ar >> s; // read from disk
    map<string, bool>::iterator i = string_map.find(s);

    if (i == string_map.end()) {
        i = string_map.insert(make_pair(s, true));
    }

    a_string = i->first;
}

void destroy_map() { string_map.clear(); }

Then, when the data structures have all been read, invoke the
destroy_map() method to clear the string_map object, thus decremented
all refcounts of strings by one.

Has anyone else encountered this and found a solution?

Also, if anyone has bright ideas on a better data structure than
std::map to use for storing hundreds of millions of strings at once
for the above purpose, that also might be nice.

Thanks.

Bill


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net