Boost logo

Boost Users :

Subject: Re: [Boost-users] Serialization cumulatively.
From: Steven Clark (steven.clark_at_[hidden])
Date: 2015-03-12 10:58:06


There are probably some constraints you didn't mention. Here are some ideas based on various different guesses.

* At 80 bytes per line, that's a total of about 15 Gb of data. With a moderately beefy computer you can hold it all in memory.

* You can store the intermediate results unserialized, just dumping your structs into files. Only serialize when you're finished. Or, keep all your intermediate results in memory until you're finished.

* Depending on what you're doing, using an actual database to store your intermediate results might improve performance.

* Reorganize your algorithm so it computes the final results for a file in one pass. Perhaps you can read each file, store some information in memory, then write results for each file.

* Store the intermediate results for all 3000 files in one file. Mmap the intermediate results file; this is another variation of the suggestion not to serialize intermediate results.

* Fix the program that reads the serialized files, so that it can read an arbitrary number of serialized records rather than just one. I'm sure this can be done - slurp in a serialized record, see if you're at the end of file, if not then repeat.

If none of these ideas are useful, at least they should help point out what other constraints you have, that were not evident in your first message.

Steven J. Clark
VGo Communications

-----Original Message-----
From: Boost-users [mailto:boost-users-bounces_at_[hidden]] On Behalf Of Tony Camuso
Sent: Thursday, March 12, 2015 9:09 AM
To: boost-users_at_[hidden]
Subject: [Boost-users] Serialization cumulatively.

I'm trying to serialize the data from multiple passes of an app on various files. Each pass generates data that must be serialized.
If I simply append each serialization, then deserialization will only occur for the first instance, since the number of records read by the deserialization will only be for the first instance.

What I'm doing is deserializing on each new pass, deleting the original file, and then serializing everything again with the new information.

If there were only a few files to process, this would not be a problem. However there are thousands of files.

Additionally, on each new pass, I am checking to see if a certain type of record has already been saved. So, with every pass, I must look up in a deeper and deeper database.

Currently, it's taking almost an hour to process about 3000 files, with an average of 55,000 lines per file. It is a huge amount of data.

However, I'm looking for a way to reduce the length of time it takes to do this processing.

Does anybody have a better idea than to cycle through the serialize-deserialize-lookup-serialize sequence for each file?
_______________________________________________
Boost-users mailing list
Boost-users_at_[hidden]
http://lists.boost.org/mailman/listinfo.cgi/boost-users


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net