|
Boost : |
From: Atry (pop.atry_at_[hidden])
Date: 2007-09-29 02:21:23
I think such a library should be based on a range algorithm like oven(
http://p-stade.sourceforge.net/oven/doc/html/index.html).
2007/9/29, Atry <pop.atry_at_[hidden]>:
>
> I have some work for it, see
> http://lists.boost.org/Archives/boost/2007/08/125945.php
>
> 2007/9/24, Phil Endecott < spam_from_boost_dev_at_[hidden]>:
> >
> > Dear All,
> >
> > Something that I have been thinking about for a while is storing
> > strings tagged with their character set. Since I now have a practical
> > need for this I plan to try to implement something. Your feedback
> > would be appreciated.
> >
> > The starting point is the idea that the character set of a string may
> > be known at compile time or at run time, and so two types of tagging
> > are possible. First compile-time tagging:
> >
> > template <character_set>
> > class tagged_string { ... };
> >
> > tagged_string<utf8> s1;
> > tagged_string<latin1> s2;
> >
> > Some typedefs would be appropriate:
> >
> > typedef tagged_string<utf8> utf8string;
> >
> >
> > Now run-time tagging:
> >
> > class rt_tagged_string {
> > private:
> > character_set cs;
> > public:
> > rt_tagged_string(character_set cs_): cs(cs_) ...
> > ...
> > };
> >
> > rt_tagged_string(utf8) s3;
> >
> > (Consise-yet-clear names for any of these classes would be great.)
> >
> >
> > I propose to implement conversion between the strings using icode
> > and/or GNU recode. It would be easy to allow this conversion to happen
> > invisibly, but it might be wiser to make conversion explicit.
> >
> > I'm not sure what the 'character_set' that I've used above should be.
> > It needs to be some sort of user-extensible enum or type-tag.
> >
> > We need character types of 8, 16 and 32 bits. wchar is not useful here
> > because it's not defined whether it's 16 or 32 bits. So I propose the
> > following, modelled after cstdint:
> >
> > typedef char char8_t;
> > typedef <implementation-defined> char16_t;
> > typedef <implementation-defined> char32_t;
> >
> >
> > I then propose a character_set_traits class:
> >
> > template <character_set>
> > class character_set_traits;
> >
> > template <>
> > class character_set_traits<utf8> {
> > typdef char8_t char_t;
> > const bool variable_width = true;
> > ...
> > };
> >
> >
> > For the fixed-width, compile-time-tagged strings I think it makes sense
> > to inherit from std::basic_string<
> > character_set_traits<charset>::char_t >. The only problem I can see
> > with this is that
> >
> > latin1string s1 = "hello world";
> > s1.substr(1,5) <--- this returns a std::string, not a latin1string
> >
> > If latin1string has a constructor from std::string (which is its own
> > base type) that's fine, i.e. we can still write:
> >
> > latin1string s2 = s1.substr(1,5);
> >
> > but unfortunately we can also write
> >
> > latin2string s3 = s1.substr(1,5);
> >
> > which is not so good.
> >
> > So a different approach is to define a set of character-set-specific
> > character types, and build string types from them:
> >
> > typedef char8_t latin1char;
> > typedef char8_t latin2char;
> >
> >
> > For variable-width character sets, the methods of std::string are less
> > useful (though far from useless). I understand that there's already a
> > utf8 iterator somewhere in Boost, can it help?
> >
> > For run-time character sets, is there any way to provide e.g. run-time
> > iterators?
> >
> > I imagine these strings being used as follows:
> > - Input to the program is either run-time or compile-time tagged with
> > any character set.
> > - Data that is not manipulated in any way it just passed through.
> > - Data that will be processed will first be converted to a suitable,
> > compile-time-tagged, character set, and if appropriate converted back
> > afterwards.
> >
> > So the absence of (useful) string operations on run-time-tagged or
> > variable-width character set data is not a problem.
> >
> > For conversions, there is the question of partial characters in
> > variable-width character sets. If a program is processing data in
> > chunks it may be legitimate for a chunk boundary to fall in the middle
> > of a UTF8 character. IIRC, icode has a method to deal with this which
> > we could expose in a stateful converter:
> >
> > charset_converter utf8_to_ucs4(utf8,ucs4);
> > while (!eof) {
> > utf8string s = get_chunk();
> > ucs4string t = utf8_to_ucs4(s);
> > send_chunk(t);
> > }
> > utf8_to_ucs4.flush();
> >
> > - but many applications may only need a stateless converter.
> >
> >
> > I will be working on this over the next couple of weeks, so any
> > feedback would be much appreciated.
> >
> > Regards,
> >
> > Phil.
> >
> >
> >
> >
> > _______________________________________________
> > Unsubscribe & other changes:
> > http://lists.boost.org/mailman/listinfo.cgi/boost
> >
>
>
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk