Boost :

Date view	Thread view	Subject view	Author view

From: Atry (pop.atry_at_[hidden])
Date: 2007-09-29 02:14:58

Next message: Atry: "Re: [boost] Strings tagged with their character set"
Previous message: Simon Atanasyan: "Re: [boost] SUN Compiler option -features=tmplrefstatic"
In reply to: Phil Endecott: "[boost] Strings tagged with their character set"
Next in thread: Atry: "Re: [boost] Strings tagged with their character set"
Reply: Atry: "Re: [boost] Strings tagged with their character set"

I have some work for it, see
http://lists.boost.org/Archives/boost/2007/08/125945.php

2007/9/24, Phil Endecott <spam_from_boost_dev_at_[hidden]>:
>
> Dear All,
>
> Something that I have been thinking about for a while is storing
> strings tagged with their character set. Since I now have a practical
> need for this I plan to try to implement something. Your feedback
> would be appreciated.
>
> The starting point is the idea that the character set of a string may
> be known at compile time or at run time, and so two types of tagging
> are possible. First compile-time tagging:
>
> template <character_set>
> class tagged_string { ... };
>
> tagged_string<utf8> s1;
> tagged_string<latin1> s2;
>
> Some typedefs would be appropriate:
>
> typedef tagged_string<utf8> utf8string;
>
>
> Now run-time tagging:
>
> class rt_tagged_string {
> private:
> character_set cs;
> public:
> rt_tagged_string(character_set cs_): cs(cs_) ...
> ...
> };
>
> rt_tagged_string(utf8) s3;
>
> (Consise-yet-clear names for any of these classes would be great.)
>
>
> I propose to implement conversion between the strings using icode
> and/or GNU recode. It would be easy to allow this conversion to happen
> invisibly, but it might be wiser to make conversion explicit.
>
> I'm not sure what the 'character_set' that I've used above should be.
> It needs to be some sort of user-extensible enum or type-tag.
>
> We need character types of 8, 16 and 32 bits. wchar is not useful here
> because it's not defined whether it's 16 or 32 bits. So I propose the
> following, modelled after cstdint:
>
> typedef char char8_t;
> typedef <implementation-defined> char16_t;
> typedef <implementation-defined> char32_t;
>
>
> I then propose a character_set_traits class:
>
> template <character_set>
> class character_set_traits;
>
> template <>
> class character_set_traits<utf8> {
> typdef char8_t char_t;
> const bool variable_width = true;
> ...
> };
>
>
> For the fixed-width, compile-time-tagged strings I think it makes sense
> to inherit from std::basic_string<
> character_set_traits<charset>::char_t >. The only problem I can see
> with this is that
>
> latin1string s1 = "hello world";
> s1.substr(1,5) <--- this returns a std::string, not a latin1string
>
> If latin1string has a constructor from std::string (which is its own
> base type) that's fine, i.e. we can still write:
>
> latin1string s2 = s1.substr(1,5);
>
> but unfortunately we can also write
>
> latin2string s3 = s1.substr(1,5);
>
> which is not so good.
>
> So a different approach is to define a set of character-set-specific
> character types, and build string types from them:
>
> typedef char8_t latin1char;
> typedef char8_t latin2char;
>
>
> For variable-width character sets, the methods of std::string are less
> useful (though far from useless). I understand that there's already a
> utf8 iterator somewhere in Boost, can it help?
>
> For run-time character sets, is there any way to provide e.g. run-time
> iterators?
>
> I imagine these strings being used as follows:
> - Input to the program is either run-time or compile-time tagged with
> any character set.
> - Data that is not manipulated in any way it just passed through.
> - Data that will be processed will first be converted to a suitable,
> compile-time-tagged, character set, and if appropriate converted back
> afterwards.
>
> So the absence of (useful) string operations on run-time-tagged or
> variable-width character set data is not a problem.
>
> For conversions, there is the question of partial characters in
> variable-width character sets. If a program is processing data in
> chunks it may be legitimate for a chunk boundary to fall in the middle
> of a UTF8 character. IIRC, icode has a method to deal with this which
> we could expose in a stateful converter:
>
> charset_converter utf8_to_ucs4(utf8,ucs4);
> while (!eof) {
> utf8string s = get_chunk();
> ucs4string t = utf8_to_ucs4(s);
> send_chunk(t);
> }
> utf8_to_ucs4.flush();
>
> - but many applications may only need a stateless converter.
>
>
> I will be working on this over the next couple of weeks, so any
> feedback would be much appreciated.
>
> Regards,
>
> Phil.
>
>
>
>
> _______________________________________________
> Unsubscribe & other changes:
> http://lists.boost.org/mailman/listinfo.cgi/boost
>

Next message: Atry: "Re: [boost] Strings tagged with their character set"
Previous message: Simon Atanasyan: "Re: [boost] SUN Compiler option -features=tmplrefstatic"
In reply to: Phil Endecott: "[boost] Strings tagged with their character set"
Next in thread: Atry: "Re: [boost] Strings tagged with their character set"
Reply: Atry: "Re: [boost] Strings tagged with their character set"

Date view	Thread view	Subject view	Author view

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk