Boost logo

Boost :

From: Sebastian Redl (sebastian.redl_at_[hidden])
Date: 2006-09-07 16:16:28


On Thu, September 7, 2006 11:04 am, loufoque wrote:
>
> It could also be possible to make the push and pull parsers more or less
> independant, so that each one can be as efficient as it can be.

True. Basically, by making a direct push parser implementation, you can avoid
the overhead of state saving that a pull parser requires.
However, that effectively means duplicating the work, so the library
should be
written in such a way that the client can easily substitute the push
parser that
is implemented on top of a pull parser with a direct push parser.

>> - Since it needs to access resources from various sources, typically
>> specified as URLs, it needs a flexible and runtime-switchable input
>> system. - In particular, it should be possible to plug schema resolvers
>> in at runtime, so that program extensions can provide support for, say,
>> the ftp: schema.
>
> That would be the work of another library, that would provide a way to
> read any kind of resource from an URL, a bit like what PHP has. That kind
> of library would be very useful too outside of the XML library.

True, and I have no intention of providing such a library within the Xml
library. But it is something that needs to be kept in mind when thinking
about
I/O.

> Maybe a more low-level approach like what boost asio provides could be
> interesting, especially since this models also provides asynchronous I/O.

I'm not sure how useful async I/O is for a parser. Parsing of incomplete data
seems more important. If you have that and you want asynchronous parsing
events,
you can start the async I/O and have a handler that parses the newly received
data, posting events to the async queue. The Xml library could provide such a
handler, but that would be a very independent feature.
The problem with the really low-level approach is the one you mention next.

> Since XML needs good Unicode support and the like, maybe there is work
> to be done in that area first in boost.

Oh, yes, the Unicode problem.
It would take an examination of systems in use, but my impression is that
most
programs use either UTF-16 or UTF-8 as their internal coding. With that in
mind,
I think it might be best to have the XML library internally support exactly
these two encodings (perhaps as two template specializations) and interact
with
the user only in these two encodings. The transcoding of whatever external
character set/encoding is used would then be an issue for the I/O interface.
However, such transcoding requires the I/O interface to be sufficiently
abstractable to provide it transparently - which is an obstacle for the
low-level approach you suggest above.

> The ability to parse partial content would be a great plus.

Yes, that seems to be important. However, it should be at the discretion
of the
user to switch it off, enabling the parser to work with a single lookahead
character. (To support partial content, either the parser needs to support
extremely complex state saving, or cache content until a complete event
has been
generated.)

> Writing a complete XML solution is a lot of work, especially if you want
> to support all XML technologies (XMLSchema, RelaxNG, XPath, XLink,
> XInclude, XPointer...)

It is. However, it is something that, I think, can be done very well in
steps,
i.e. first release supports only pull parsing, second release adds push
parsing,
third adds a DOM, fourth another technology, etc.
As long as you consider all possible technologies when implementing the basic
ones, this ought to be feasible.
And yes, it's a lot of work. I'm willing to put a lot of work into it.

> Maybe it could be interesting to reuse libxml2, which is under the MIT
> license, to build something on top of it. Of course first we need to weight
> the gains behind a new C++ implementation.

See my reply to Stefan Seefeld. I think that within Boost, depending on an
external library, no matter what license, is a very bad idea. I also think
that
an implementation intended from the ground up to work with C++ is a better
choice.

Thank you for your comments. They've given me some ideas.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk