Boost logo

Boost :

From: Erik Wien (wien_at_[hidden])
Date: 2005-03-15 22:03:26


Hi. I hope you guys still remember this, despite my lack of activity on
this list, but I was here about developing a Unicode library for Boost a
while back, and well.. We're off! :) We are now at a point in the
development where we would really appreciate feedback from you Boosters.

We have developed a *very* early prototype based on some of the ideas
put forward in the earlier discussion here, and we would like you guys
to comment on it. The design is by no means locked, and only represent
one possible way to implement a Unicode string class. If you have an
alternate solution, please let us know. We are using an evolutionary
development model (prototyping), so we are open for design changes
should there be a need for that. You can download the source code
(VC++/DevCpp projects only for now, sorry) from our site here:
http://hovedprosjekter.hig.no/v2005/data/Gr5_unicode (the files
section). On the same site we have also set up a forum where you can
discuss different aspects of the project if you want to keep it off
list. No registration is required, so it should be hassle (password) free.

Current design:
The current design is based around the concept of «encoding_traits».
These are templated on the different encodings used in Unicode (UTF-8,
16 and 32, both endians), and provide functions and typedefs for working
on code units (8, 16 and 32 bit integers respectively) in any encoding.
These traits are then used for implementing different interfaces that
externally use 32bit code points, thereby abstracting away the
underlying encoding.

The string class itself is created with encoding transparency in mind.
Also at the class level. This means that the encoding used in the string
is not a template parameter of the string class itself (making each
instantiation of the string it's own type), but rather a parameter of an
implementation class that is used internally to hold the string.
Something like this (highly simplified):

class impl_base
     {
     // A lot of pure virtual functions for manipulating a string.
     };

template<typename encoding>
class impl
     {
     // Implement the functions...obviously.
     };

class encoded_string
     {
     impl_base* m_impl;

     template<typename encoding>
     void set_encoding(encoding enc_tag)
         {
         m_impl = new impl<encoding>();
         }
     };

The reason for doing this is that it allows functions that take
encoded_string parameters to be blissfully unaware of what encoding they
are working on, without having to templatize (it that a word?) the
function itself. (Something I understood was a bit of a worry for some
in the last discussion.) An alternate way of doing this (something we
also tested when developing the current version), is to simply template
the string class itself on encoding, but then you loose the above
advantage of being able to have non-template functions working on
multipe encodings. You do however gain speed (I would assume), since you
wouldn't have the overhead of virtual function-calls, as well as a less
complex implementation.

There's also an implementation of the Unicode Character Database in the
prototype, along with an implementation of the normalization algorithms,
but I won't go into the details of them here (to keep this from becoming
a novel). Should be easy enough to understand if you want to.

Anyway.. Comments are as always welcome. Either here, or in the forum at
the site.

Regards
- Erik

To Eric Niebler: Did you recieve the mail I sendt you a while back about
the whole contact-person debackle? (on the Boost Consulting address)
Never got a reply, so I'm not sure if it went through.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk