Boost logo

Boost :

From: Rogier van Dalen (rogiervd_at_[hidden])
Date: 2005-03-19 13:13:59


On Sat, 19 Mar 2005 08:57:55 -0800, Jonathan Biggar <jon_at_[hidden]> wrote:
> Rogier van Dalen wrote:
> >>Be careful with making a global assertion. Different users of a Unicode
> >>library will need to access the data at different levels. Some will
> >>need the raw encoding bytes or words, some will need code points, and
> >>some will need 'grapheme clusters'.
> >>
> >>The library should support working at the level that each particular
> >>user needs, and different parts of an application or library may need to
> >>work at multiple levels.
> >
> > A decision must be made. Certainly you should have access to code
> > points; and you should be able to work at multiple levels. However,
> > one level has to be the default level. Most programmers should be able
> > to get what they want by using boost::unicode_string (or whatever it's
> > going to be called). We need to make a "global assertion" that's
> > correct 99% of the time.
>
> I don't see why there has to be a "default" inteface at all. There
> should just be multiple interfaces, [...]

I'm sorry, I don't see how these propositions are mutably exclusive.

I believe we are talking about different kinds of users. Let's get
this clear: I was assuming that the Unicode library will be aimed at
programmers doing everyday programming jobs whose programs will have
to deal with non-English characters (because they're bound to be
localised, or because non-English names will be inserted in a
database, or whatever), i.e. people who have no idea about how Unicode
works and don't want to, as long as it does work.
Correct me if I'm wrong, but you seem to assume the library will be
used mostly by those who need to code things like codeset conversions,
who should know a great deal about Unicode.

What I think would be a good interface:

// A string of code points, encoded UTF-16 (or templated).
class code_point_string {
public:
    //...
    const std::basic_string<char16_t> code_units();
};

// A string of "grapheme clusters", with a code_point_string underlying.
// The string is always in a normalisation form.
template <class NormalisationPolicy = NormalisationFormC>
   class unicode_string
{
public:
   //...
   const code_point_string & code_points() const;
};

Those who need to process code points can happily use
code_point_string; others can use unicode_string.

> [...] Other programs may have a need to
> distinguish between the two, and need the ability to convert a Unicode
> string from the form where all combining characters are combined and the
> form where they are all separate explicit codepoints.

I believe you would not need to manipulate code points to convert
*all* characters in a string from one normalisation form to another.
(See the interface proposal above.)

> A way of telling
> the library that you don't care about the difference is to ensure that
> every string you use is canonicalized into the form that makes your job
> easier.

I'd say the normalisation form of a string is an invariant that the
library rather than the user should deal with.

> [...]

Regards,
Rogier


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk