Boost logo

Boost :

From: Dave Harris (brangdon_at_[hidden])
Date: 2002-12-11 13:11:36


In-Reply-To: <uu1hmc37y.fsf_at_[hidden]>
On Mon, 09 Dec 2002 16:05:53 -0500 David Abrahams
(dave_at_[hidden]) wrote:
> This process [...] should also reflect a reluctance to
> begin writing code too early.

We've spent quite a lot of time in discussion already. I think at least
some of these issues need running code to resolve them.

> 1. Agreement on terms. In particular, I strongly suggest beginning
> with the definitions of serialization and persistence outlined by
> Augustus Saunders in
> http://lists.boost.org/MailArchives/boost/msg39598.php. I realize
> that Robert didn't like those definitions, but they resonated for
> most people (including me), and seem to provide an excellent
> starting point.

For what it's worth, I didn't like those definitions. In my view
serialisation is the right name for what the submitted library did.
Persistence is just serialisation to a persistent medium. Persistence is a
property of media rather than formats. It all but comes for free once you
have decent serialisation.

But I agree there is a need to separate plain serialisation from data
conversion/formatting/filtering. With XML, for example, the intent is not
so much to save and load the data (there are more efficient ways of doing
that), but to work on the data in its XML form with XML tools.

It is maybe worth identifying another possible purpose, that of reporting
or logging, where the serialised output is meant to be read by a human.
For this it may not be necessary to have input archives, just output
archives.

For the sake of concreteness, I suggest "serialisation", "conversion" and
"logging".

> 2. Careful description of scope. Answer questions like:
> * Is this a persistence or serialization library?

If it can cover all bases reasonably well, that would be nice. I'd love it
if implementing persistence gave me a readable debugging dump for free.

However, we need to see whether good support for, say, XML, can be done
without compromising the simpler requirements. I think we need some
running code.

If good support can't be got, then we need to decide whether poor support
should be included. There may be a danger of leading users into a dead
end.

It seems to me there is a natural divide between formats where the meaning
of a field is determined by its order in the data stream, and formats
where the meaning of fields is determined by identifying tags.

Sometimes when I think about XML, I feel I really want to build a
/reflection/ library, a way for UDTs to describe their instance variables.
XML could work by publishing a map of fields: their names, types and their
byte offsets within their object. A single such description could be used
by both input and output routines. It'd support tagged fields being read
in random order - what I mean by "good" XML support.

"Poor" XML support would mean providing tag names in the output, but more
or less ignoring them on input, and using the order of fields to determine
their meaning.

I suspect that even poor support would be worth-while. The "dead-end" fear
is a chimera - people can start with limited XML, and then expand to full
XML later and still be able to read their data. Also, I think that
providing tag information will enable a wealth of new applications.

> * Is it important to be able to plug in arbitrary archive
> formats?

The fewer restrictions on archive format the better. But that is just a
platitude. I think we need running code to see what can and can't be done
in practice.

> * Is it important to be able to use the same UDT serialization
> code to write several different archive formats?

I think so. However, I think a subsetting approach would be OK. Some
formats need more UDT information than others. Formats which don't need
the info should quietly ignore it, and users should not be required to
provide information which they know will be ignored.

For example, XML really needs the UDT to provide tags or field names.
Users who know they don't want XML needn't bother.

> * What kinds of applications are we intending to serve?
> * What kinds of applications are we explicitly NOT intending to
> serve?

Do we need a list? Here's a starting point:

Serialisation:
o Deep copy by writing to a RAM buffer and then reading again
   within the same address space.

o Sending data structures down pipes, through serial ports,
   in TCP/IP packets, and through OLE/COM/XXX links.

o Sending data structures through email, HTTP "POST" requests,
   and similar dumb channels.
   

Persistence:
o Storing data structures to disk files, CDs, DVDs etc and
   loading them back in later.

o Storing data structures on paper for later scanning, faxing etc.

o Storing and loading configuration data to/from the Windows
   Registry.

Conversion:
o To XML, for post-processing by other apps.

o From XML, for accepting data prepared by other apps.

o Reading XML from hand-crafted strings, as a way of setting up
   a test fixture during unit testing.

o Comparing XML output with hand-crafted strings as a way of
   verifying the success of a unit test.

Logging:
o Text for a general purpose readable debug dumps, during
   development.

o Text for error logs, event logs etc, which may be produced
   on a customer site and eg mailed to the programmers.

o Text for simple report generation, eg for programmer utilities
   where detailed cosmetic formatting isn't needed.

I hope that all the above could be accommodated without much trouble.

> [...] a variable-length format. This [...] entirely obviates
> the need a text archive format

Actually a text format could still be useful if it was "email safe". Some
gateways strip out certain control characters, will not pass characters
with the high bit set, have restrictions on line-length and total message
size, etc. We could send a binary format through a "uuencode" filter, but
a text format which was natively safe would be neater (and probably more
efficient). Eg people may want to embed serialised data into web pages.

> * Is it important to allow all UDTs to be separately versioned?

Yes!

My application shares libraries with other teams. There is no one central
place where an archive schema ID can be stored, and updated whenever any
class changes anywhere. Each class needs its own schema ID, so that it can
manage its own evolution in a private, self-contained way.

> Changing the format of a single class always
> creates a backward compatibility problem for new archives anyway.

We need to consider (and document) the format change scenarios which we
can cope with. Adding or removing instance variables is pretty
straightforward. Although it can get messy, if each UDT has its own schema
ID then the mess can be contained within the UDT. I have found this to be
quite manageable.

Changes to the inheritance hierarchy is harder. I don't think the current
submission can handle this. Currently I use MFC and store my own schema
IDs, and I chain derived classes to base classes manually. This means I
/can/ cope with changes in the inheritance hierarchy. I have found this
facility valuable and wouldn't want to lose it.

Renaming classes is something which MFC doesn't support. I believe that
some of the proposals which came up during the review would allow this. It
is basically a matter of hooking into the factory method which creates a
new instance from a class name, and substituting an instance of a
different class.

-- Dave Harris


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk