Boost logo

Boost :

From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2007-07-12 06:58:50


Mathias Gaunard wrote:
> Stefan Seefeld wrote:
>
>> I would appreciate if anybody interested into a future boost.xml
>> submission would have a look, provide feedback, or even get
>> involved into the (ongoing) development.
>
> I have to admit I am a bit confused by the interface.
>
> For example, the first thing I see is that all nodes are actually
> pointers.

What's the best solution to this, and the memory management issues in general?

It's difficult, because a node could be a simple text node with just
one character in it, or it could be the root of a multi-megabyte
document. Clearly in the first case it's OK to copy the data, while in
the second case you would want to avoid deep copying in some way, e.g.
with smart pointers and/or some sort of copy-on-write system.
(Thinking aloud) I can imagine a tree where the top levels have
different types from the bottom ones.

I have just dug out an old email that I wrote to the xmlwrapp list in
2004. I had observed that xmlwrapp, which is a C++ wrapper around
libxml2 with much in common with Stefan's proposal, had an overhead of
about 150 bytes per node. I had been getting binary data from a
database (with negligible space overhead compared to a C++ struct),
converting it to in-memory XML and then converting that to HTML or SVG
using XSLT. The in-memory XML was terribly bloated to the extent that
it was the bottleneck in the system, for a couple of reasons. Firstly,
as noted above, the data structures had a significant overhead (150
bytes for the XML node compares to a 30 byte per element overhead in a
std::list<std::string>). And secondly there's the overhead for
self-description i.e. all the strings making up the tag and attribute
names. This second issue could be helped if a symbol-table approach
were used (e.g. the Boost.Spirit symbols class; each n-byte string
would be reduced to hopefully one pointer).

The view I'm coming to is that I'd prefer an XML class with enough
template parameters to make these sorts of issues resolvable by the
user, depending on the requirements of their application:

struct node {};

template <typename STR_T = std::string>
struct textnode: public node, public STR_T { .... };

template <typename STR_T = std::string>
struct comment: public node, public STR_T { .... };

template <typename TAG_NAME_T = std::string,
           typename ATTR_NAME_T = std::string,
           typename ATTR_VAL_T = std::string,
           typename CHILD_T = boost::shared_ptr<node> >
struct element: public node {
   TAG_NAME_T name;
   std::map<ATTR_NAME_T,ATTR_VAL_T> attributes;
   std::list<CHILD_T> children;
};

You would then need parser and writer functions that would work with
any suitable string and child-pointer types. I don't know whether the
CHILD_T parameter can be used to support both pointers-to-children and
inline-children, nor whether a copy_on_write_ptr<T> could somehow fit
into this. Any thoughts?

I'm also interested to hear what people think about XPath. I have not
used it much outside of XSLT. My feeling is that it is easy to write
"//foo" to find all the <foo> elements, but that this will descend all
the way to the leaves of your tree even if you know that <foo> elements
can only appear in certain contexts near the root. To do efficient
searches you either need a set of index structures (like XLST's keys,
or Boost.MultiIndex), or a kind of tree-traversal-iterator that lets
you be a bit more selective about when to descend into a subtree and
when to skip it.

Regards,

Phil.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk