Boost logo

Boost :

From: Vladimir Prus (ghost_at_[hidden])
Date: 2004-04-14 07:23:39


John Maddock wrote:

>> I have already gone over this in other posts, but, in short,
> std::basic_string
>> makes performance guarantees that are at odds with Unicode strings.
>
> Basic_string is a sequence of code points, no more no less, all
> performance guarentees for basic_string can be met as such.

Using basic_string is container for code points is fine, but what to do with
the other operations: find, replace, whatever else.

It would be nice if the interface that user will use most of the time is the
most convenient. If we agree that user most likely to want find/replace
whatever on sequence of *characters*, it's not good to require something
like

   std::find(unicode_iterator(s.begin()), unicode_iterator(s.end()),
             .....);

to do that, and since it's not possible to change definition of std::string
you might want boost::unicode_string which find methods works on
characters. Another possible approach can be:

   typedef basic_string<wchar_t> unicode_codepoints_string;

   class unicode_characters_string {
   public:
        unicode_characters_string(const unicode_codepoints_string&);

        class iterator {
        };
        iterator begin();
        iterator end();
        // no find* methods!
    private:
        // might even hold rep by reference.
        unicode_codepoints_string& m_rep;
   };

After that, one simply states that to do find/replace in
'unicode_characters_string' one should use the string_algo library.
Together with a big warning that basic_string<> does not really do 100%
correct find/replace this might be enough.

In fact, I'm still not sure basic_string is all that usefull. If you have
unicode_characters_string which does all operations correctly, and
basic_string, which does only some operations correctly, why would you use
basic_string? For efficiency?

> I'm talking about code-points (and sequences thereof), not characters or
> glyphs which as you say consist of multiple code points.
>
> I would handle "characters" and "glyphs" as iterator adapters sitting on
> top
> of sequences of code points. For code points, basic_string is as good a
> container as any (as are vector and deque and anything else you care to
> define).

iterator adapters are fine for implementation. I fear that requiring user to
employ iterator adapters directly is bad decision.

> Working on sequences of code points always requires care: clearly one
> could erase a low surrogate and leave a high surrogate "orphanned" behind
> for
> example. One would need to make it clear in the documention that
> potential problems like this can occur.

And what can user do to avoid such problems, except for not using
basic_string?

- Volodya


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk