Boost logo

Boost :

From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2007-09-23 19:37:32


Dear All,

Something that I have been thinking about for a while is storing
strings tagged with their character set. Since I now have a practical
need for this I plan to try to implement something. Your feedback
would be appreciated.

The starting point is the idea that the character set of a string may
be known at compile time or at run time, and so two types of tagging
are possible. First compile-time tagging:

template <character_set>
class tagged_string { ... };

tagged_string<utf8> s1;
tagged_string<latin1> s2;

Some typedefs would be appropriate:

typedef tagged_string<utf8> utf8string;

Now run-time tagging:

class rt_tagged_string {
private:
   character_set cs;
public:
   rt_tagged_string(character_set cs_): cs(cs_) ...
   ...
};

rt_tagged_string(utf8) s3;

(Consise-yet-clear names for any of these classes would be great.)

I propose to implement conversion between the strings using icode
and/or GNU recode. It would be easy to allow this conversion to happen
invisibly, but it might be wiser to make conversion explicit.

I'm not sure what the 'character_set' that I've used above should be.
It needs to be some sort of user-extensible enum or type-tag.

We need character types of 8, 16 and 32 bits. wchar is not useful here
because it's not defined whether it's 16 or 32 bits. So I propose the
following, modelled after cstdint:

typedef char char8_t;
typedef <implementation-defined> char16_t;
typedef <implementation-defined> char32_t;

I then propose a character_set_traits class:

template <character_set>
class character_set_traits;

template <>
class character_set_traits<utf8> {
   typdef char8_t char_t;
   const bool variable_width = true;
   ...
};

For the fixed-width, compile-time-tagged strings I think it makes sense
to inherit from std::basic_string<
character_set_traits<charset>::char_t >. The only problem I can see
with this is that

latin1string s1 = "hello world";
s1.substr(1,5) <--- this returns a std::string, not a latin1string

If latin1string has a constructor from std::string (which is its own
base type) that's fine, i.e. we can still write:

latin1string s2 = s1.substr(1,5);

but unfortunately we can also write

latin2string s3 = s1.substr(1,5);

which is not so good.

So a different approach is to define a set of character-set-specific
character types, and build string types from them:

typedef char8_t latin1char;
typedef char8_t latin2char;

For variable-width character sets, the methods of std::string are less
useful (though far from useless). I understand that there's already a
utf8 iterator somewhere in Boost, can it help?

For run-time character sets, is there any way to provide e.g. run-time iterators?

I imagine these strings being used as follows:
- Input to the program is either run-time or compile-time tagged with
any character set.
- Data that is not manipulated in any way it just passed through.
- Data that will be processed will first be converted to a suitable,
compile-time-tagged, character set, and if appropriate converted back afterwards.

So the absence of (useful) string operations on run-time-tagged or
variable-width character set data is not a problem.

For conversions, there is the question of partial characters in
variable-width character sets. If a program is processing data in
chunks it may be legitimate for a chunk boundary to fall in the middle
of a UTF8 character. IIRC, icode has a method to deal with this which
we could expose in a stateful converter:

charset_converter utf8_to_ucs4(utf8,ucs4);
while (!eof) {
   utf8string s = get_chunk();
   ucs4string t = utf8_to_ucs4(s);
   send_chunk(t);
}
utf8_to_ucs4.flush();

- but many applications may only need a stateless converter.

I will be working on this over the next couple of weeks, so any
feedback would be much appreciated.

Regards,

Phil.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk