Boost logo

Boost :

From: Dave Harris (brangdon_at_[hidden])
Date: 2002-12-12 19:21:37


In-Reply-To: <u4r9kjgce.fsf_at_[hidden]>
On Wed, 11 Dec 2002 18:17:21 -0500 David Abrahams
(dave_at_[hidden]) wrote:
> I'm willing to use any terms that everyone will agree to
> (including yours)

Me too.

> but whichever terms we use should be at least as clearly defined
> as what Augustus wrote.

I'm afraid I couldn't quite get my head around them. To me, "persistence"
and "serialisation" are at different levels of abstraction. Serialisation
is one way to implement persistence. As such they do not compete; they are
not mutually incompatible alternatives.

I think we have a consensus that a fully general persistence library, that
could be implemented by dumping RAM images to disk or whatever, is not
what we want at this point. I'm OK with that. What I don't understand is
what Augustus means when he says:

     I think that plain serialization (your term) should be
     explicitly *not supported* and defer that use case to a
     safer, more airtight approach with a persistance library.

What is gained by excluding persistence, and/or the simpler kinds of
serialisation (where source and destination are the same program running
on the same hardware with the same compiler)?

> So far, you haven't provided a clear definition of serialization.

Actually I agree with Augustus's, as far as I understand it, which isn't
far. He seems to imply that serialisation does not need to bother with
object factories or object lifetime management. I don't understand how
that can be. I can't figure out whether UTD versioning belongs to
Persistence or to Serialisation. He says Persistence, but doesn't that
make Persistence asymmetrical and involve it in non-trivial transforms?
How can it be achieved by transparent meta-programming magic? I doubt a
robust but transparent persistence mechanism can be built.

> > We could send a binary format through a "uuencode" filter, but a
> > text format which was natively safe would be neater (and probably
> > more efficient).
>
> Why would it be more efficient?

Because it has more knowledge.

For example, if we write out the number 500 using an alphabet of 64 safe
characters, it takes 2 characters. If we write it out using all 256
characters, it still takes 2 of them, but now to make it safe each
character needs 2 safe characters to represent it, so it takes 4 bytes
altogether. The double conversion is more verbose because the first part
loses information.

> > Adding or removing instance variables is pretty straightforward.
>
> Erm. I am still leery of thinking of all this in terms of "instance
> variables". The representation of state written to the archive may or
> may not have a direct correspondence to a class' data members.

Sure. Call them "fields" if it helps. I sometimes find it helpful to think
in terms of concrete examples.

The point is, sometimes a class grows so that its serialised
representation gets bigger.

> "schema ID"?

A term from MFC. It is what the submitted library calls a file_version.

> Can you give an example of "containing the mess within the UDT?"

I don't have a good example to hand. Here's a made up one:

    void MyClass::load( CArchive &ar ) {
        int schema = load_schema( ar, 10, 15 );
        
        if (schema >= 13)
            MyBaseClass::load( ar );
        else {
            MyOldBaseClass::load( ar );
            int myBaseClassData;
            ar >> myBaseClassData;
            MyBaseClass::init( myBaseClassData );
        }
        
        if (schema >= 14)
            ar >> myVar1;
        else
            myVar1 = 100;
            
        if (schema == 14) {
            int unused;
            ar >> unused;
        }
            
        if (schema >= 13)
            ar >> myVar2;
        else {
            MyOldType t;
            ar >> t;
            myVar2 = convert( t );
        }
   }

The first line fetches the class's schema/version number. The arguments to
load_schema() are used for range-checking - load_schema() may throw. For
safety, it's best not to use schemas 0 or 1 so I usually start from 10.

The next block chains explicitly to the base class. In this case older
archives used a different base class so we have some nasty code to make it
work.

The next few lines load a variable. Old archives didn't store it, so we
have to provide a default value.

Schema 14 added an int which was later removed; if it is present we have
to skip over it.

The last few lines load another variable. Older archives used a different
type so we may need to load a temporary object of that old type and then
convert it.

I don't know what you think of this code - whether it horrifies you for
being too low level or lacking in design foresight. It is my practical
experience. Designs age, and the history accretes in the serialisation
load routines. I hope that the boost library will be able to support this
kind of evolution. I don't claim that code like this is the best solution,
but in practice I have found it works.

> It's beginning to sound more and more like the metaclass framework
> some people have been hinting at.

Do you mean that some framework could handle a history like that reflected
in the above code, automatically? How would that work? How it could track
changes to the base class over time?

Java manages it by storing a snapshot of the class hierarchy (as it was
when the archive was made) into the archive. That gives it enough
information to figure out how the hierarchy has changed. However, it can
lead to rather bloated archives.

> > Renaming classes is something which MFC doesn't support. I believe
> > that some of the proposals which came up during the review would
> > allow this.
>
> Why should a class name come into play, unless you were using
> std::type_info to archive it?

MFC uses its own macro-driven RTTI system, in which classes are identified
by name. If we don't use that, or type_info, then there is probably no
problem. We do need to make sure that we can add new classes without
somehow breaking the correspondence between the old classes and whatever
the archive stores to represent them.

> ... assuming there is such a factory method.

The archive has to store something to represent classes, and has to be
able to create instances of the classes so represented, in order to
restore polymorphic pointers. That's what I mean by a "factory method". I
don't mean to imply a particular implementation.

> It sounds like your viewpoint on this is very heavily influenced by
> one particular kind of application.

Yes. Well, less so then my choice of words may have implied. And of course
in that passage I was discussing a trap that MFC fell into. Your earlier
comment:

     [...] the use of type_info::name() for type identification. Even
     if these were optional components to the library, they could
     provide enormous benefit for some applications.

made it sound like you might make the same mistake. If we use class names
to identify types, we need to make sure we can rename classes and still
load old files.

But generally, yes, I know what kind of applications I write and I hope
boost will support. If other people have different expectations, shouldn't
they write about them? Isn't that what this pre-coding discussion is for?

I'm sorry for the length of this post, but now that I've written it, maybe
you can tell me whether I want a persistency library or a serialisation
library.

-- Dave Harris


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk