Boost logo

Boost Users :

Subject: Re: [Boost-users] Serialization cumulatively.
From: Tony Camuso (tcamuso_at_[hidden])
Date: 2015-03-13 05:04:59


On 03/12/2015 10:58 AM, Steven Clark wrote:
> There are probably some constraints you didn't mention.

Of course. :)

> Here are some ideas based on various different guesses.

And thank you so much for taking the time to respond to my post.

> * At 80 bytes per line, that's a total of about 15 Gb of data. With
> a moderately beefy computer you can hold it all in memory.
>
> * You can store the intermediate results unserialized, just dumping
> your structs into files. Only serialize when you're finished. Or,

True that, but one of the details I omitted is that this app is linked
with libsparse, which is like lint on steroids. This tool parses
preprocessor files and creates a tree in memory of all the symols in
the file. My code walks this tree to create a database of info germane
to our purposes. Of course, this uses more memory again. With about
3000 files to process, there isn't enough memory on the average
workstation to contain it all at once.

When I tried to do this all in memory, even a big kahuna machine
with 32 GB of memory and 48 cores tanked after about the 100th
file.

> * Depending on what you're doing, using an actual database to store
> your intermediate results might improve performance.

Tried that. The performance of boost serialization trumps the
performance of a dbms. :)

> * Reorganize your algorithm so it computes the final results for a
> file in one pass. Perhaps you can read each file, store some
> information in memory, then write results for each file.
>
> * Store the intermediate results for all 3000 files in one file.
> Mmap the intermediate results file; this is another variation of the
> suggestion not to serialize intermediate results.
>
> * Fix the program that reads the serialized files, so that it can
> read an arbitrary number of serialized records rather than just one.
> I'm sure this can be done - slurp in a serialized record, see if
> you're at the end of file, if not then repeat.

These steps offer the most promise.

The code already reads all the serialized records into memory, to a
vector, with one deserialization call.

The fault lies in the algorithm I am using to manage duplicate
symbols when I encounter them.

What I do for every symbol is ...

. create a new node (vertex)
. search the existing list for duplicates
. if the symbol is a duplicate, add its connections (edges) to the
   pre-existing node and delete the new node.
. next

Performance drops from about 3 files per second to a less than one
per second at the end. For the 3000+ files, it takes more than 50
minutes on an 8-core with 16 GB of memory.

To speed things up, I've created a nodes-only list, which reduces
the size of the vector to be searched by a factor of 4. I haven't
got this working, yet, so I have yet to determine the performance
gain.

> If none of these ideas are useful, at least they should help point
> out what other constraints you have, that were not evident in your
> first message.
>
> Steven J. Clark VGo Communications

Many thanks, Steven. I realize how busy everybody is, and I really
appreciate the thoughtful and valuable input.

Regards,
Tony


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net