Boost logo

Boost :

Subject: Re: [boost] [unicode] Interest Check / Proof of Concept
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2008-11-19 14:43:08


James Porter wrote:
> Over the past few months, I've been tinkering with a Unicode string
> library. It's still *far* from finished, but it's far enough along that
> the overall structure is visible. I've seen a bunch of Unicode proposals
> for Boost come and go, so hopefully this one will address the most
> common needs people have.

Hi Jim,

Mine was probably one of those proposals that you looked at; for the
record the code is all available at

   http://svn.chezphil.org/libpbe/trunk/include/charset/

and nearby directories. I was reasonably happy with my implementations
of the most common character sets (i.e. unicode, ASCII, iso8859), but I
wanted to explore some of the more esoteric ones to understand the
implications that they would have on how a general-purpose framework
should work. For example, I wanted to explore how error handling
policies could be specified and what conditions they would need to
handle. The last work that I did with this code was a general-purpose
command-line conversion utility that could be used to benchmark the
conversions. Input and output character sets and error policies could
be set from the command-line, but the problem that I hit was that
making these things template parameters led to a code-size and
compilation-time explosion. That means that I'll need to rethink a few
things, but it has been low on my to-do list.

> The library is based on two (immutable) string types: ct_string and
> rt_string. ct_strings are _C_ompile _T_ime tagged with a particular
> encoding, and rt_strings are _R_un _T_ime tagged with an encoding.

Mutable vs. immutable strings is something that has been briefly
discussed before. My personal preference has been for mutable strings,
but without the O(1) random access guarantee of a std::string. I also
considered strings where the only mutation allowed is appending, i.e.
there's a back_insert_iterator. Why do you prefer immutable strings?

One argument for mutable strings is simply that std::string is mutable,
and that a proposal is more likely to prove popular if it changes less
w.r.t. existing practice.

I also have run-time and compile-time tagging. My feeling now is that
compile-time-tagging is the more important case. Data whose encoding
is known only at run-time can be handled using a more ad-hoc method if
necessary. I also struggled to find good names for these things; I
don't find ct_string and rt_string great. Do any readers have suggestions?

> This is to allow for faster conversion when the encoding is known at
> compile-time, but to allow for conversion at run-time (useful for
> reading XML!).
>
> General usage would look something like this:
>
> ct_string<ct::utf8> foo("Hello, world!");

typedef ct_string<ct::utf8> utf8string;

> ct_string<ct::utf16> bar;
> bar.encode(foo);

Well it's actually decoding the utf16 and encoding the utf8. Maybe
"transcode", and preferably as a free function:

transcode(bar,foo);

equivalent to:

std::copy(back_insert_iterator(bar),foo.begin(),foo.end());

> rt_string baz;
> baz.encode(bar,rt::utf8);

So the encoding of the rt_string is not stored in the string?

> Note the use of ct::utf8 and rt::utf8. As you might expect from the
> syntax, ct::utf8 is a type, and rt::utf8 is an object. Broadly speaking,
> to create an encoding, you create a class with read and write methods,
> and then you create an instance of an rt_encoding<MyEncoding>. Most of
> this is laid out in the comments of my code, so I won't go into too much
> detail here.

I'll try to find time to have a look, but I do encourage you to post
more details to the list. That tends to generate more discussion than
"please look at the code" proposals do.

> There's still a lot missing from the code (most notably,
> dynamically-sized strings and string concatenation),

So what is your underlying implementation? Not std::string?

> but here's a
> rundown of what *is* present:
>
> * Compile-time and run-time tagged strings
> * Re-encoding of strings based on compile-/run-time tags
> * Uses simple memory copying when source and dest encodings are the same
> * Forward iterators to step through code points in strings
>
> If you'd like to take a look at the code, it's available here:
> http://www.teamboxel.com/misc/unicode.tar.gz . I've tested it in gcc
> 4.3.2 and MSVC8, but most modern compilers should be able to handle it.
> Comments and criticisms are, of course, welcome.

One of my priorities has been performance; it would be good to compare
e.g. utf8-to/from-utf16 conversion speed.

My feeling about the way forward is as follows:

- A complete character set library is a lot of work.

- A library that only understands Unicode is less work, but is it what
people need?

- Is there a consensus about mutable vs. immutable strings? Perhaps we
should start by defining a new string concept, removing the
character-set-unfriendly aspects of std::string like indexing using
integers, and see what people think of it. I have been trying to use
only std::algorithms and iterators with strings in new code, but it can
often be simpler to use indexes and the std::string members that use or
return them.

- It would be useful to factor out the actual Unicode bit-bashing
operations. I have implementations of them that I have carefully
tuned, and they are ready for wider use even though the rest of my code isn't.

Regards, Phil.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk