Boost logo

Boost :

From: Anthony Williams (anthony_w.geo_at_[hidden])
Date: 2005-11-07 12:50:58


Stefan Seefeld <seefeld_at_[hidden]> writes:

> Anthony Williams wrote:
>
>>>>In order to use a particular external type, such as std::string, the user has
>>>>to supply a specialization of converter<> for their type, which converts to
>>>>and from the libxml xmlChar type.
>>>
>>>Correct. That's the price to pay for not forcing any particular unicode library
>>>on users who want to use the XML API.
>>
>>
>> Hmm. What is an xmlChar? From your string.hpp, it appears it is the same as a
>> normal char, since you can cast a char* to an xmlchar*, but I don't know
>> libxml2, so I wouldn't like to assume.
>
> You must not care ! :-)
> Seriously, though, xmlChar is indeed an alias for 'char', meaning the strings
> in libxml2 can be passed around as usual (in particular, are null-terminated).
> The format, though, is UTF-8, so a simple cast to 'char' only makes sense if
> the document contains ASCII only.
>
>> I would rather that the boost::xml API defined a type (even if it was a
>> typedef for the libxml xmlChar), and the requirements on that type
>> (e.g. ASCII, UTF-32 or UTF-8 encoding).
>>
>> By exposing the underlying character type of the backend like this, you are
>> restricting the backends to those that share the same internal character type
>> and encoding, or imposing an additional conversion layer on backends with a
>> different internal encoding.
>
> Why ? I propose a mechanism involving at most a single conversion. Why the
> additional layer ?

Assume I know the encoding and character type I wish to use as input. In order
to specialize converter<> for my string type, I need to know what encoding and
character type the library is using. If the encoding and character type are
not specified in the API, but are instead open to the whims of the backend, I
cannot write my conversion code.

For example, your string.hpp converter only works if the native encoding for
char on the current platform is UTF-8, or the application only uses a shared
subset (e.g. ASCII). For platforms where EBCDIC is the default, this won't
work.

>> Just as an example, my axemill parser uses a POD struct containing an unsigned
>> long as the character type, so that each Unicode Codepoint is a single
>> "character", and I don't have to worry about variable-length encodings such as
>> UTF-8 internally.
>
> (that may consume considerably more memory for big documents, and a lot of waste
> if the content is ASCII)

Indeed it might, but that's a decision I'm happy with for now --- it doesn't
currently store entire documents in memory, as it's a SAX-style push parser
(though the API is completely non-SAX). I'm just using it here as an example
of a backend that doesn't use UTF-8.

>> If I wanted use axemill as the backend parser, and handle
>> std::wstring input on a platform where wchar_t was UTF-32, but keep xmlChar in
>
> (some nit-picking: wchar_t and UTF-32 are unrelated concepts. The former provides
> a storage type of some (unfortunately platform-dependent) size, while the latter
> defines an encoding. See the various unicode-related threads in this ML.)

I know about the issues surrounding wchar_t and encodings. They are not
entirely unrelated.

The platform has to pick a default encoding for wchar_t, so we know whether or
not 0x61==L'a', for example. On platforms where wchar_t is 32 bit, this can be
(and often is) UTF-32.

>> the API, the converter would have to change UTF-32 to UTF-8 (I assume), and
>> then internally this would have to be converted back to UTF-32.
>
> Well, we definitely need some 'xml char trait' for the backend to fill in that
> provides sufficient information for users to write their own converter.
> Again, the hope is to do that such that any redundant conversion / copying can
> be avoided.

Good.

>> I would suggest that the API accepts input in UTF-8, UTF-16 and UTF-32. The
>> user then has to supply a conversion function from their encoding to one of
>> these, and the library converts internally if the one they choose is not the
>> "correct" one.
>
> It already does. libxml2 provides conversion functions. I need to hook them
> up into such an 'xml char trait'.

I don't understand how your response ties in with my comment, so I'll try
again.

I was suggesting that we have overloads like:

node::append_element(utf8_string_type);
node::append_element(utf16_string_type);
node::append_element(utf32_string_type);

With two of them (but unspecified which two) converting to the correct
internal encoding.

If the user has EBCDIC or shift-JIS data, then they need to convert to one of
these three standard types. If their data is already correctly encoded, but
not using the same string type, then they just need to recast to the
appropriate string type.

>>>As my node implementations already know their type (in terms of an enum tag),
>>>casting is a simple matter of rewrapping the implementation by a new proxy.
>>
>>
>> There's nothing stopping this from continuing to work.
>
> Right, though it becomes a bit more involved. node_ptr doesn't hold a 'node *'
> despite its name, but rather a 'node' (which itself, being a proxy, points to
> a xmlNode). thus, casting node_ptr to element_ptr (or the other way around) will
> actually construct a new element (wrapper). The only way to make this work
> with polymorphic nodes is to heap-allocate nodes (inside node_ptr), which
> requires extra calls to new / delete. We may override these operators for
> node types, though, but it's not *that* trivial to optimize.

Agreed, it will require careful thought.

>>>In your case I couldn't encapsulate the binding in a single place, as you mention
>>>yourself.
>>
>>
>> Agreed, but you wouldn't have to. It's also more flexible --- it would allow
>> input to come in with one encoding/string type, and output to be generated
>> with a different encoding/string type, but the same boost::xml::dom objects
>> could be used.
>
> How big of an issue is that, really ? How many users use different unicode libraries
> in the same application ? If they really want different encodings, they may as well
> do the conversion within their unicode library.

Imagine, for example a web browser or XML editor. The XML comes in as a byte
stream with an encoding tag such as a Charset-encoding field (if you're
lucky). You then have to read this and convert it from whatever encoding is
specified to the DOM library's internal encoding, do some processing and then
output to the screen in the user's chosen encoding.

If I specify the conversions to use directly on the input and output, then I
can cleanly separate my application into three layers --- process input, and
build DOM in internal encoding; process DOM as necessary; display result to
user.

If the string type and encoding is inherently part of the DOM types, this is
not so simple.

>>>What would be possible, though, is to put all types into a single parametrized
>>>struct:
>>>
>>>template <typename S>
>>>struct types
>>>{
>>> typedef typename document<S> document_type;
>>> typedef typename node_ptr<element<S> > element_ptr;
>>> ...
>>>};
>>
>>
>> This is preferable to the current proposed API, but I still prefer that the
>> conversion happens at the boundary as per my suggestions, rather than the
>> entire classes being parameterized.
>
> I'm not sure I understand your requirement ? Do you really want to plug in
> multiple unicode libraries / string types ? Or do you want to use multiple
> encodings ?

Multiple encodings, generally. However, your converter<> template doesn't
allow for that --- it only allows one encoding per string type.

Anthony

-- 
Anthony Williams
Software Developer
Just Software Solutions Ltd
http://www.justsoftwaresolutions.co.uk

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk