Boost logo

Boost Users :

From: Robert Ramey (ramey_at_[hidden])
Date: 2007-03-09 13:53:20


I would first consider something else.

It turns out that for reasons I don't want to go into here that
std::string is designated as "primitive". That is there is no
reference counting (by serialization load), version, etc
with std::string type objects. This is probably a good
choice except that it makes std::string "special" in
comparison to other types. In general, the only other
types consider "primitive" are C++ data types that are
truely primitive.

So, in contrast to the default behavior for std::collections,
if the SAME string is saved twice - an actual copy is
saved an restored. This might be what most people
expect from a datatype like std::string, but it might be
an issue in some applicatons. Note that by
"SAME string" I'm referring to the same datum - not
two different strings with the same contents. So, the
serialization could be the source of your issue if
you're saving the SAME string many times.

A closer reading of your post, suggests that the above
isn't what you're referring to. I left in the above just
to clarify my thinking on the issue.

It sounds to me that you're telling me that the gcc
standard library keeps a counted string - "copy on write"
implementation so that if a and b are strings the
operation a = b doesn't result in a duplication of the content.

Which would surprise me.

But assuming that's the case, serialization would "lose" the
reference counts when the strings are recreated without
using the "=" operator. If this is the case, here are a couple
of ideas to consider.

Define your own serializaton for std::string and use it instead
of the one in the serialization library. This is probably a bad
idea as it would attribute your special behavior to a standard
object and would make your archives and programs non
portable and harder to support if you want to ask us for help.

Define you're own string class derived from std::string. This
string class could be serialized using your own special sauce
without losing portablity. The could be formlated as
a "serialization wrapper" as described in the manual so that
you're code would only have to use this "special string"
in the process of serialization and not through out your program.
Look in the recent document and the "is_wrapper" typetrait
for more information.

So now the problem boils down to how your going to capture
and restore the fact that these strings share underlying data.
At first one would think that just letting your wrapper class
use the default tracking behavior eliminate duplicates would
solve your problem. But I don't think so. As I said above,
I don't think that you're serializing the SAME (see above)
string one million times. I think you're serializing a million
different strings which happen to contain the same data.

It seems to me that you'll have to delve into the implementation
of the string class you're using and gain access to the internals
of the implementation and figure out how to capture the
reference to the shared contents and serialize that.

Robert Ramey

Bill Lear wrote:
> We have a massive amount of data to serialize, on the order of several
> gigabytes. Lots of strings involved, maybe hundreds of millions.
>
> We discovered that the data structure in memory would bloat enormously
> when read back in from disk (say, from 2 gig to 3.1 gig). We think we
> have tracked this down to (gcc implementation) string reference counts
> not being "restored". I think a solution for us is to do something
> like the following:
>
> static map<string, bool> string_map;
>
> template <class Archive>
> void read_string(Archive ar, string& a_string) {
> string s;
> ar >> s; // read from disk
> map<string, bool>::iterator i = string_map.find(s);
>
> if (i == string_map.end()) {
> i = string_map.insert(make_pair(s, true));
> }
>
> a_string = i->first;
> }
>
> void destroy_map() { string_map.clear(); }
>
> Then, when the data structures have all been read, invoke the
> destroy_map() method to clear the string_map object, thus decremented
> all refcounts of strings by one.
>
> Has anyone else encountered this and found a solution?
>
> Also, if anyone has bright ideas on a better data structure than
> std::map to use for storing hundreds of millions of strings at once
> for the above purpose, that also might be nice.
>
> Thanks.
>
>
> Bill


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net