Boost logo

Boost :

From: David Abrahams (dave_at_[hidden])
Date: 2002-12-11 18:17:21


brangdon_at_[hidden] (Dave Harris) writes:

>> 1. Agreement on terms. In particular, I strongly suggest beginning
>> with the definitions of serialization and persistence outlined by
>> Augustus Saunders in
>> http://lists.boost.org/MailArchives/boost/msg39598.php. I realize
>> that Robert didn't like those definitions, but they resonated for
>> most people (including me), and seem to provide an excellent
>> starting point.
>
> For what it's worth, I didn't like those definitions. In my view
> serialisation is the right name for what the submitted library did.
> Persistence is just serialisation to a persistent
> medium. Persistence is a property of media rather than formats. It
> all but comes for free once you have decent serialisation.

The term "persistence", as I've heard it used for years, is used to
denote the ability to use "the same" complex data structure across
successive invocations of the same program. Thus, the data
"persists". Serialization to a persistent medium is one way to
implement it, and serialization is almost always a component of the
system. However, there are other approaches. For example, the system
might leave parts of the data structure on disk until such time as
they're accessed.

I still think the other definitions are more useful. In your terms,
"persistence" slices off a tiny fraction of the space of useful
functionality, and everything we care about lies in the domain of
"serialization". I'm willing to use any terms that everyone will
agree to (including yours), but whichever terms we use should be at
least as clearly defined as what Augustus wrote. So far, you haven't
provided a clear definition of serialization.

> But I agree there is a need to separate plain serialisation from
> data conversion/formatting/filtering. With XML, for example, the
> intent is not so much to save and load the data (there are more
> efficient ways of doing that), but to work on the data in its XML
> form with XML tools.

Yes. Or more often, to ensure that such work will be possible should
anyone ever want to do that (premature generalization - can you tell
I'm cynical?)

> It is maybe worth identifying another possible purpose, that of
> reporting or logging, where the serialised output is meant to be
> read by a human. For this it may not be necessary to have input
> archives, just output archives.

That would be serialization without deserialization.

> For the sake of concreteness, I suggest "serialisation",
> "conversion" and "logging".

I think "conversion" is already too overloaded. Either way, I'd like
to have very clear definitions of these.

>> 2. Careful description of scope. Answer questions like:
>> * Is this a persistence or serialization library?
>
> If it can cover all bases reasonably well, that would be nice. I'd
> love it if implementing persistence gave me a readable debugging
> dump for free.

Now I assume you're using your own terminology for persistence.

> However, we need to see whether good support for, say, XML, can be
> done without compromising the simpler requirements. I think we need
> some running code.

We have some; you can use the existing library submission.

> If good support can't be got, then we need to decide whether poor
> support should be included. There may be a danger of leading users
> into a dead end.

Agreed.

> It seems to me there is a natural divide between formats where the
> meaning of a field is determined by its order in the data stream,
> and formats where the meaning of fields is determined by identifying
> tags.
>
> Sometimes when I think about XML, I feel I really want to build a
> /reflection/ library, a way for UDTs to describe their instance
> variables. XML could work by publishing a map of fields: their
> names, types and their byte offsets within their object.

In general the latter is only possible for PODs.

> A single such description could be used by both input and output
> routines. It'd support tagged fields being read in random order -
> what I mean by "good" XML support.

Interesting. Sounds like the "metaclass" facilities others have been
mentioning recently would be useful here.

However, I'm not sure that a good XML description of a data structure
neccessarily has a correspondence to its data members. I sure don't
care about the data members in a std::vector<std::string>.

> "Poor" XML support would mean providing tag names in the output, but
> more or less ignoring them on input, and using the order of fields
> to determine their meaning.

That's one aspecet. I'm sure we can dream up many other ways to define
"poor XML support" ;-)

> I suspect that even poor support would be worth-while. The
> "dead-end" fear is a chimera - people can start with limited XML,
> and then expand to full XML later and still be able to read their
> data. Also, I think that providing tag information will enable a
> wealth of new applications.
>
>> * Is it important to be able to plug in arbitrary archive
>> formats?
>
> The fewer restrictions on archive format the better. But that is just a
> platitude. I think we need running code to see what can and can't be done
> in practice.

I think that's premature. It's easy enough to write code with a
design so limited that it incorrectly "proves" various things can't be
done.

>> * What kinds of applications are we intending to serve?
>> * What kinds of applications are we explicitly NOT intending to
>> serve?
>
> Do we need a list?

Yes.

> Here's a starting point:
>
> Serialisation:
> o Deep copy by writing to a RAM buffer and then reading again
> within the same address space.
>
> o Sending data structures down pipes, through serial ports,
> in TCP/IP packets, and through OLE/COM/XXX links.
>
> o Sending data structures through email, HTTP "POST" requests,
> and similar dumb channels.

Hmm. What needs to be present on the receiving end in order to do
something useful with that data?

<snip>

>> [...] a variable-length format. This [...] entirely obviates
>> the need a text archive format
>
> Actually a text format could still be useful if it was "email
> safe".

Sure. I was trying to say that within the scope of applications that
the library seemed to be attacking, everything could be done with a
single format.

> Some gateways strip out certain control characters, will not pass
> characters with the high bit set, have restrictions on line-length
> and total message size, etc. We could send a binary format through a
> "uuencode" filter, but a text format which was natively safe would
> be neater (and probably more efficient).

Why would it be more efficient?

> Eg people may want to embed serialised data into web pages.

>> * Is it important to allow all UDTs to be separately versioned?
>
> Yes!
>
> My application shares libraries with other teams. There is no one central
> place where an archive schema ID can be stored, and updated whenever any
> class changes anywhere. Each class needs its own schema ID, so that it can
> manage its own evolution in a private, self-contained way.

OK, I'm convinced.

>> Changing the format of a single class always creates a
>> backward compatibility problem for new archives anyway.
>
> We need to consider (and document) the format change scenarios which we
> can cope with.

Good plan.

> Adding or removing instance variables is pretty
> straightforward.

Erm. I am still leery of thinking of all this in terms of "instance
variables". The representation of state written to the archive may or
may not have a direct correspondence to a class' data members.

> Although it can get messy, if each UDT has its own schema ID then
> the mess can be contained within the UDT.

"schema ID"? Can you give an example of "containing the mess within the UDT?"

> Changes to the inheritance hierarchy is harder. I don't think the
> current submission can handle this. Currently I use MFC and store my
> own schema IDs, and I chain derived classes to base classes
> manually. This means I /can/ cope with changes in the inheritance
> hierarchy. I have found this facility valuable and wouldn't want to
> lose it.

It's beginning to sound more and more like the metaclass framework
some people have been hinting at.

> Renaming classes is something which MFC doesn't support. I believe
> that some of the proposals which came up during the review would
> allow this.

Why should a class name come into play, unless you were using
std::type_info to archive it?

> It is basically a matter of hooking into the factory method which
> creates a new instance from a class name, and substituting an
> instance of a different class.

... assuming there is such a factory method. It sounds like your
viewpoint on this is very heavily influenced by one particular kind of
application.

-- 
                       David Abrahams
   dave_at_[hidden] * http://www.boost-consulting.com
Boost support, enhancements, training, and commercial distribution

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk