Boost logo

Boost :

From: Stefan Seefeld (seefeld_at_[hidden])
Date: 2005-11-08 09:34:46


Anthony Williams wrote:

>>Ah, I think I understand what you mean by 'character type'. Yes, you are right.
>>The code as I posted it to the vault is missing these bits. that enable users
>>to write converters without knowing backend-specific details. However, some
>>'dom::char_trait' should be enough, right ?
>
>
> Yes and no. Suppose my incoming data is a stream of 8-bit "characters", using
> Shift-JIS encoding. I need to write a converter to convert this to whatever
> encoding is accepted by the XML API. I need to know which encoding to use when
> I write my converter --- if the API is expecting UTF-16 stored in a string of
> unsigned shorts, my converter is going to be quite different to if the API is
> expecting UTF-8 stored in a string of unsigned chars. I also need to know how
> to construct the final string --- whether I need to provide a
> boost::xml::char_type*, or whether I need to construct a
> boost::xml::string_type from a pair of iterators, or something else.

[...]

>>>I was suggesting that we have overloads like:
>>>
>>>node::append_element(utf8_string_type);
>>>node::append_element(utf16_string_type);
>>>node::append_element(utf32_string_type);
>>>
>>>With two of them (but unspecified which two) converting to the correct
>>>internal encoding.
>>
>>Oh, but that multiplies quite a chunk of the API by four !
>
>
> What's the fourth option? Yes, I agree it multiplies the API, but for the
> convenience of users.

Sorry, that's because I can't count.

>>Typically, a unicode library provides converter functions, so what advantage
>>would such a rich interface have instead of asking the user to do the conversion
>>before calling into the xml library ?
>
>
> It avoids the user doing any conversion in many cases.

Well, I would phrase it differetly: instead of encapsulating unicode-related
functionality you are suggesting to spread it across various APIs.

Really, we are now exclusively arguing about unicode-related issues, which
I deliberately desiged out of my API. Let me rephrase the relevant part of
my suggestion:

The XML API provides a means to write a converter to internalize (and exteralize)
unicode strings, but otherwise remains agnostic to unicode issues. This allows the
library to collaborate with any exteral unicode library without duplicating
its functionality.

> If the encoding is only available as a compile-time constant, that won't help
> me write a converter. I need it available as a software-writing-time constant
> for that (i.e. specified in the documentation).
>
> If you don't want to fix the encoding in the docs, maybe we should require
> that the user supply conversions to each of UTF-8, UTF-16 and UTF-32, and the
> library will use whichever is most convenient.

Exactly. Unicode libraries provide these conversion functions, and the user
should be able to implemet the boost::xml::dom::converter with these.

> In the input layer of an application, you need to deal with all the variety of
> encodings that the user might supply. I'm quite happy to use a single Unicode
> library to deal with the conversions, but I can imagine having to deal with
> numerous external encodings. I would like the rest of the application to have
> no need to know about the complications of the input handling, and the variety
> of encodings used --- provided I get a set of DOM objects from somewhere, the
> rest of the application shouldn't care.

Access to the dom content is routed through the unicode library, by means of
the converter. Thus, whatever requirement you have for dealing with encodings
etc. should all be taken care of by that.

> Once the input has been handled, and the DOM built, there might be additional
> input in terms of XPath expressions, or element names, which might be in
> another encoding still. Again, the choice of input encoding here should have
> no impact on the rest of the application.

Correct. Same reasoning as above.

> With the current design, the whole API is tied to a single external string
> type, with a single converter function for converting to the internal string
> type. This implies that if you wish to use different encodings, you need a
> different external string type, and therefore you end up with different
> template instantiations for different encodings, and my nice separate
> application parts suddenly need to know what encodings are used for input and
> output.

Oh, now I see your point ! You argue that multiple encodings will be tied to
multiple C++ types, even if they are part of the same unicode library.
I'm not quite sure what to say. I suspect there are ways around this issue
with a clever choice of the string type template argument for the library.
But if not, let's fix that once it becomes a problem.
I'd rather start simple and let the system evolve once we see users plug
real unicode libraries into it.

> For axemill, I decided to provide a set of conversion templates for converting
> between encodings.

what unicode libraries are you working with ? As I said above, I'd suspect these
to provide all coversions, no matter whether that would generate a new C++
type or not.

Regards,
                Stefan


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk