Boost logo

Boost :

From: Stefan Seefeld (seefeld_at_[hidden])
Date: 2005-11-07 11:59:08


Anthony Williams wrote:

>>>In order to use a particular external type, such as std::string, the user has
>>>to supply a specialization of converter<> for their type, which converts to
>>>and from the libxml xmlChar type.
>>
>>Correct. That's the price to pay for not forcing any particular unicode library
>>on users who want to use the XML API.
>
>
> Hmm. What is an xmlChar? From your string.hpp, it appears it is the same as a
> normal char, since you can cast a char* to an xmlchar*, but I don't know
> libxml2, so I wouldn't like to assume.

You must not care ! :-)
Seriously, though, xmlChar is indeed an alias for 'char', meaning the strings
in libxml2 can be passed around as usual (in particular, are null-terminated).
The format, though, is UTF-8, so a simple cast to 'char' only makes sense if
the document contains ASCII only.

> I would rather that the boost::xml API defined a type (even if it was a
> typedef for the libxml xmlChar), and the requirements on that type
> (e.g. ASCII, UTF-32 or UTF-8 encoding).
>
> By exposing the underlying character type of the backend like this, you are
> restricting the backends to those that share the same internal character type
> and encoding, or imposing an additional conversion layer on backends with a
> different internal encoding.

Why ? I propose a mechanism involving at most a single conversion. Why the
additional layer ?

> Just as an example, my axemill parser uses a POD struct containing an unsigned
> long as the character type, so that each Unicode Codepoint is a single
> "character", and I don't have to worry about variable-length encodings such as
> UTF-8 internally.

(that may consume considerably more memory for big documents, and a lot of waste
if the content is ASCII)

> If I wanted use axemill as the backend parser, and handle
> std::wstring input on a platform where wchar_t was UTF-32, but keep xmlChar in

(some nit-picking: wchar_t and UTF-32 are unrelated concepts. The former provides
a storage type of some (unfortunately platform-dependent) size, while the latter
defines an encoding. See the various unicode-related threads in this ML.)

> the API, the converter would have to change UTF-32 to UTF-8 (I assume), and
> then internally this would have to be converted back to UTF-32.

Well, we definitely need some 'xml char trait' for the backend to fill in that
provides sufficient information for users to write their own converter.
Again, the hope is to do that such that any redundant conversion / copying can
be avoided.

> I would suggest that the API accepts input in UTF-8, UTF-16 and UTF-32. The
> user then has to supply a conversion function from their encoding to one of
> these, and the library converts internally if the one they choose is not the
> "correct" one.

It already does. libxml2 provides conversion functions. I need to hook them
up into such an 'xml char trait'.

> I would imagine that a user that works in UTF-8 will choose to
> provide a UTF-8 conversion, someone that works with UCS-2 wchar_t characters
> will provide a UTF-16 conversion, and someone that works with UTF-32 wchar_t
> characters will provide a UTF-32 conversion. Someone who uses a different
> encoding, such as EBCDIC, will provide a conversion appropriate to their
> usage. This should produce the minimum of cross-encoding conversions.

Yup.

[...]

> Two virtual method calls (node.visit/visitor.handle) vs two plain function
> calls (node.type/handle_xxx), a switch, and a "cast" that constructs a new
> object.
>
> Have you run a profiler on it?
>
> Premature optimization is the root of all evil; I would rather have something
> that helps me write correct code, rather than fast code. I really dislike
> switch-on-type code, and I'm not convinced that it is necessarily faster in
> all cases.

Ok, fair enough. That can easily be tested, and the change is (almost) straight
forward.

>>As my node implementations already know their type (in terms of an enum tag),
>>casting is a simple matter of rewrapping the implementation by a new proxy.
>
>
> There's nothing stopping this from continuing to work.

Right, though it becomes a bit more involved. node_ptr doesn't hold a 'node *'
despite its name, but rather a 'node' (which itself, being a proxy, points to
a xmlNode). thus, casting node_ptr to element_ptr (or the other way around) will
actually construct a new element (wrapper). The only way to make this work
with polymorphic nodes is to heap-allocate nodes (inside node_ptr), which
requires extra calls to new / delete. We may override these operators for
node types, though, but it's not *that* trivial to optimize.

>>Using RTTI to represent the node's type is definitely possible. I'm just not
>>convinced of its advantages.
>
>
> I'm not convinced of the advantage of not using it ;-)
>
>
>>>One additional comment on re-reading the samples --- having to instantiate
>>>every template for the external string type seems rather awkward.
>>>
>>>One alternative is to accept and return an internal string type, and provide
>>>conversion functions to/from the user's external string type. This way, the
>>>library is not dependent on the string type, but it does add complexity to the
>>>interface.
>>
>>Right, I considered that. One has to be careful with those string conversions,
>>though, to avoid unnecessary copies.
>
>
> Yes. However, this problem exists with your proposed API anyway --- by
> converting on every call to the API, you are forcing possibly-unnecessary
> conversions on your users. For example, they may want to add the same
> attribute to 50 nodes; your proposed API requires that the attribute name is
> converted 50 times. Accepting an internal string type, and making the user do
> the conversion allows the user to do the conversion once, and then pass the
> converted string 50 times.

Good point ! I have to think about how to 'internalize' and reuse a string.

[...]

>>In your case I couldn't encapsulate the binding in a single place, as you mention
>>yourself.
>
>
> Agreed, but you wouldn't have to. It's also more flexible --- it would allow
> input to come in with one encoding/string type, and output to be generated
> with a different encoding/string type, but the same boost::xml::dom objects
> could be used.

How big of an issue is that, really ? How many users use different unicode libraries
in the same application ? If they really want different encodings, they may as well
do the conversion within their unicode library.

>
>
>>What would be possible, though, is to put all types into a single parametrized
>>struct:
>>
>>template <typename S>
>>struct types
>>{
>> typedef typename document<S> document_type;
>> typedef typename node_ptr<element<S> > element_ptr;
>> ...
>>};
>
>
> This is preferable to the current proposed API, but I still prefer that the
> conversion happens at the boundary as per my suggestions, rather than the
> entire classes being parameterized.

I'm not sure I understand your requirement ? Do you really want to plug in
multiple unicode libraries / string types ? Or do you want to use multiple
encodings ?

Regards,
                Stefan


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk