Boost logo

Boost :

From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2007-10-17 16:23:02


Hi James, thanks for replying.

James Porter wrote:
> I've been thinking about this off and on as well, though have been a
> little too busy to give it the write-up it deserves. That said, I think
> your code is a pretty good start. While I agree that tagged strings
> shouldn't automatically convert on assignment, I think recode() isn't
> the most useful way to go about it.
>
> In practice, I expect that most code conversion would occur during I/O,
> so I'd prefer to see the conversion done by the stream itself. recode()
> could still exist as a convenience function, though.

Yes, other people have suggested similar things. Even if it were true
that most charset conversion occured during I/O - and that's not been
my experience in my own work - then I would still argue that charset
conversion should be available for use in other contexts.

I see my recode() member function (i.e. utf8_string s2 =
s1.recode<utf8>()) ultimately being a convenience around some sort of
free function or functor. The need to track shift-states and partial
characters makes this a bit complex, though.

> On the subject of converting between different encodings of strings, I
> noticed that you had some concerns about assignment between two
> different encodings using the same underlying type (latin1_string s =
> utf8_string("foo") for example). This could be resolved by using a
> nominally different char_traits class when inheriting from basic_string.

Yes; it has been suggested that they differ in their state_type. I
plan to investigate this, but if someone more knowledgeable would like
to do so, please go ahead.

> However, this would cause problems with I/O streams, since they expect a
> particular character type and char_traits. This goes back to my point
> above: the I/O streams should be aware of string tagging (if not
> directly responsible for it).

I imagine that an I/O streams library or some sort of adapter layer
compatible with these strings would be necessary.

> I'll need to think about how to specify character sets so that they're
> usable at compile time and run time, though my instinct would be to use
> subclasses that can be stored in a map of some sort. The subclassing
> would handle compile-time tagging, and the map would handle run-time
> tagging:
>
> class utf8 : public charset_base { ... };
> charset_map["utf8"] = new utf8();
>
> ...
>
> tagged_string<utf8> foo;
> rt_tagged_string bar;
> bar.set_encoding("utf8");
>
> This should combine the benefits of your first and third choices (type
> tags and objects), though I haven't thought about this enough to be
> confident that it's the right way to go.

Yes, this has some advantages. But using a map has the disadvantage
that lookups are more expensive, compared to the array indexed by enum
that I have; in my code, getting the char* name of a charset is a
compile-time-constant operation. I'm not sure how much that matters in practice.

Thanks for your feedback. Does anyone else have any comments? Do
please have a look at my example code
(http://svn.chezphil.org/libpbe/trunk/examples/charsets.cc) and tell me
how well it would fit in with your approaches to charset conversion.

Regards,

Phil.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk