Boost logo

Boost :

From: Stefan Seefeld (seefeld_at_[hidden])
Date: 2007-07-13 21:01:32


Sebastian Redl wrote:

> The proposed API shares these problems with the DOM:
> 1) Very verbose.
> 2) Indirect node construction. I can't create an element by
> instantiating the class element - it has a protected constructor.
> Instances are created through some sort of factory, typically by calling
> methods of document and element that create children and return them.
> This is not a very natural syntax.

The reason to delegate to a factory is to let it do a lot of resource
management that thus can be hidden from the user. There is a lot to be
considered, as each node lives in a particular context given by the document
as well as its position in it (think of namespaces, for example).

It may of course be possible to hide that by providing stack variables
that are merely proxies, so the actual instantiation will be done lazily,
once the (proxy) node is inserted into the document. I haven't thought too
hard about that, since to me using a factory is a natural means to allow
encapsulation.

> The API has some additional disadvantages:
> 1) No real namespace support. To create an element in a given namespace,
> I have to register a prefix and then use the prefix in the element name.
> Worse, to find out the namespace of an element, I have to parse the
> string for the prefix (it's one find and one substr operator, but still)
> and then look it up to find the full namespace URI. (Depending on the
> semantics of element::lookup_namespace, I might have to walk the tree
> for that.) Given that some documents, especially generated ones,
> sometimes have multiple binding for the same namespace, this is overly
> tedious. Namespace URI and local name of an element should be
> first-class properties. The prefix:local convention is really just a
> hack - in the Infoset view of the information, it doesn't even exist.
> (The prefix does, the combined name doesn't. See 2.2 of the XML Infoset
> spec.)

OK, I agree. This can be addressed independently from all the rest, however.

> 2) Not an existing standard. Whatever else you can say for the DOM, it
> is well-known. Whatever your language - Java, C#, PHP, JavaScript, C++
> with Xerces - the DOM is, minor variations in capitalization aside, a
> constant. By providing essentially the same functionality, but through a
> slightly different interface, you lose the recognition value of the DOM
> without gaining much. (You're avoiding the full complexity of the DOM,
> which is a good thing.)

Sorry, that argument I don't accept. Yes, I deliberately chose not to
use the API as obtained from using the CORBA C++ bindings of the OMG IDL DOM.
The hope is to get something better, much more naturally tied to modern
C++ idioms. Whether or not I achieve that is to be discussed, and can be
criticized, but the lack of conformance to existing DOM APIs in itself
is hardly an argument worth debating.

> 3) Not as extensive. I'm not talking about the annoying multiple
> redundancy of the DOM here, but of low-level functionality such as
> preserving entity references in the node tree. For some low-level tasks,
> this is important stuff. Not that the DOM is really extensive: it
> provides no way, for example, to modify the document schema. (It allows
> introspection, at least.)

OK, the API represents the Infoset, and thus has no idea of what an entity
is. I'm not sure whether that would be worth adding. And if, it may be
some hook into the XML writer (the XML parser already has it).

I don't understand what you are aiming at in your comment about the
'document schema'.

> The API has one clear advantage over the DOM: the use of iterators. The
> DOM also has many shortcomings that the API, due to its restricted scope,
> doesn't have. All in all, though, I think the chosen balance between
> closeness to the DOM and doing something different and interesting is
> not good.
>
> There are some thing I simply consider mistakes:
> 1) cdata should derive from text. It's basically a special case that
> only differs in its serialization from the general form.

That's an implementation detail (IMO). Semantically, a text node and
a cdata node are distinct, and so visitors shouldn't give users access
to a cdata node as a text node. (And what else would the ISA relationship
be good for ?)

> 2) You have a class dtd, but to access it you use
> document::internal_subset. This dtd class doesn't provide access to the
> internal subset however - only the document type declaration, after
> which it is named, (Yes, the document type /definition/ has the same
> abbreviation. Very unfortunate, that.)

I'm sure this can be refined. (In fact, I don't think DTDs will play any
significant role in the future, as other document type definitions become
more popular, such as relaxng).

> Some more issues:
> 1) The whole node/node_ptr mess. From reading your earlier posts, I
> thought that node and friends where value-like classes, that they
> directly represent the nodes, whereas node_ptr was a special smart
> pointer that provided the memory management and the shallow copying
> semantics. Only, upon reading the code, I find that node_ptr contains an
> instance of its element type, not a pointer to it, which means that node
> and derived are the smart pointers with shallow copying and memory
> management. Except that they don't: the pointers are never freed until
> the entire owner document is destroyed. Or the node is explicitly
> removed from its parent. Oh, and document is an exception to this
> convention, because it actually is a value-style class.
> That's not to say that this isn't a sensible overall strategy. It just
> is extremely confusing given your naming conventions. As far as I can
> see, the only thing node_ptr actually does is make access less
> convenient by requiring indirection - and thus double indirection for
> node iterators. (*i)->foo sucks, sorry. Apropos, why isn't node_iterator
> written using the Boost.Iterator library?

OK, I understand that I need to rethink how to represent things. To me
it is clear, however, what I want: encapsulate nodes and their management
such that the user doesn't have to care for allocation / deallocation,
but instead accesses (dereferences) nodes via node_ptr proxies.

> 2) write_to_file as a member of document. This is asymmetric to
> parse_file being a free factory function. It's also unnecessary and, in
> my opinion, not a good idea for various reasons. One is the
> aforementioned asymmetry. Another is the public interface size, as
> mentioned in one of the Effective C++ books: write_to_file doesn't
> actually need access to document's internals, because it just serializes
> the node tree, right? (That it needs access to the contained libxml
> pointer is a detail that shouldn't affect the interface. Make it a
> friend if you have to, but implementations should have the option of not
> making it one.)

That's a good point. I will make write_to_file a free-standing function.

> There's more inconsistency here. The Efficient XML Interchange working
> group is overdue again with their first working draft, but a binary,
> compact serialization for XML _is_ in the works. Once they publish a
> recommendation, there will be two official serializations of the XML
> Infoset. And there are several unofficial ones already. Each one of
> these needs a pair of parse/serialize functions. (Not necessarily
> provided by the library, of course.) With write_to_file being a member,
> it enjoys a unique status that it doesn't really deserve. It also enjoys
> a very ambiguous name, as does parse_file.
> Even leaving that aside, there's also the option of multiple
> parse/serialize pairs just for a single format. They could take
> alternative input sources: a boost::path instead of a std::string for
> identification, for example. Or a std::istream as a data
> source/std::ostream as a data sink. Or a boost::url, when such a class
> is written, together with a pluggable communication framework for
> transparently fetching network URLs. Or whatever.
> Point is, all these are simple extensions to the system, but there is
> inconsistency if one function is a member when no other can be.

Right.

> Sorry for not being very constructive. I'll take a look at the reader
> next time I find some time, and make more general comments the time
> after that. I hope to get there within a week.

Thanks for your comments. I will try to address them, if only by working
on documentation that give a rationale for the various choices I have taken.

Regards,
                Stefan

-- 
      ...ich hab' noch einen Koffer in Berlin...

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk