Boost logo

Boost :

From: Rogier van Dalen (rogiervd_at_[hidden])
Date: 2004-10-20 14:02:40


> Such a one-size-fits-all unicode_string is guaranteed to be inefficient
> for some applications. If it is always stored in a decomposed form, an
> XML library probably wouldn't want to use it, because it requires a
> composed form. And making the encoding an implementation detail makes it
> inefficient to use in situations where binary compatibility matters
> (serialization, for example).

This is a good point. I should think however that a codecvt facet
should be responsible for serialization rather than the unicode
string. Furthermore, IMO, the invariants of any unicode string should
be checked when reading from a file anyway. This should happen on two
levels: the UTF-8 or UTF-16 encoding must be correct, and no dangling
combining characters or combining characters on control characters
should occur; furthermore, the normalisation form should probably be
checked as well. So I'm not sure whether using Normalisation Form C
rather than D will give you any big performance gains - you may need
less memory though.

> Also, it is impossible to store an abstract unicode character in
> char32_t because there may be N zero-width combining characters
> associated with it.

I'm not sure what you mean here, but if you mean that one abstract
character would be one codepoint: that's not true, I'm sorry to say.
Especially languages for which there was no encoding before Unicode,
and funny scientists like mathematicians or linguists (I count myself
among the latter) will use abstract characters that have not been
encoded as precomposed characters in Unicode. Nor will they be; the
precomposed forms are there for backwards compatibility, mainly. Note
that adding a combining mark to a precomposed character takes
decomposing it and recomposing it, so that might be pretty slow.

> Perhaps having a one-size-fits-all unicode_string might be a nice
> default, as long as users who care about encoding and canonical form
> have other types (template + policies?) with knobs they can twiddle.

I do agree with that; and also I seem to remember from the discussion
back in April that some people felt they needed to iterate over
codepoints too.

So please allow me to propose an altered version of my earlier
proposal, taking in various suggestions from this thread.

namespace unicode {

// ***** Level 1: code units *****

// The code unit sequence is not explicitly specified, but it
// could be std::string, or SGI rope<char16_t>, or whatever.
// I think it would be reasonable to require replace, find,
// find_first_of and similar.

// ***** Level 2: codepoints *****

// The codepoint sequence is templatised on the code unit
// sequence.
// Depending on CodeUnits::value_type the encoding will
// be UTF-8, UTF-16, or UTF-32.

template <class CodeUnits>
class codepoint_string
{
    CodeUnits _code_units;
public:
    // ...

    // A user is not allowed to change the code unit
    // sequence, but it may be copied, or serialised.
    const CodeUnits & code_units();

    // The iterator is a bidirectional iterator.
    // This is cheap to implement on any correct Unicode-
    // encoded string since the iterator is not stateful.
    typedef ... iterator;

    // A size() member function is not included;
    // count() may be nice though.
};

// ***** Level three: characters *****

// Normalisation policies
struct normalisation_form_c {};
struct normalisation_form_d {};

// Input policies
struct as_utf8 {};
struct as_latin1 {};
struct as_utf16 {};
// etcetera

// Error checking policies
struct throw_on_encoding_error {};
struct workaround_encoding_error {};

// An abstract Unicode character
// I have not given this guys' interface much though yet.
template <class NormalisationForm>
class character
{
    char32_t _base;
    std::vector<char32_t> _marks;
public:
    character (char32_t base);

    character & operator = (char32_t base);

    const char32_t & base() const;

    void add_mark (char32_t mark);

    // An iterator to iterate over the combining marks.
    // It is a const_iterator because we wouldn't want to
    // allow introducing non-marks in the list of marks.
    typedef std::vector<char32_t>::const_iterator mark_iterator;

    mark_iterator mark_begin() const;
    mark_iterator mark_end() const;

    // ....
};

// The actual Unicode string
template <class CodeUnits, class NormalisationForm, class ErrorChecking>
class string
{
    codepoints<CodeUnits> _codepoints;
public:
    // Initialise with a utf8 string; normalise and check for errors
    string (const CodeUnits &, as_utf8_tag);

    template <class CodeUnits2, class NormalisationForm2, class ErrorChecking2>
    string (const string <CodeUnits2, NormalisationForm2, ErrorChecking2> &);

    // ....

    const codepoints<CodeUnits> & codepoints();
    const CodeUnits & code_units();

    // Another bidirectional iterator, this one iterates
    // over abstract characters.
    class iterator
    {
    public:
        // Returns an object with an interface equal to
        // unicode::character, but it changes the string.
        character_ref operator *() const;

        // ...
    };
};

} // namespace unicode

// ***** That was all *****

Mutating operations on unicode::string may require O(n) time where n
is the length of the code unit sequence, depending on CodeUnit's
properties. That's why using an SGI rope would make sense.

Some default template parameters for unicode::string should be thought of.

Regards,
Rogier


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk