Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Robert Ramey (ramey_at_[hidden])
Date: 2011-01-18 14:03:24


Matus Chochlik wrote:
> On Tue, Jan 18, 2011 at 6:46 PM, Peter Dimov <pdimov_at_[hidden]>

> Boost, as the cutting edge C++ library should try to enforce new
> standards and not dwell on old and obsolete ones.

A boost library can't just make a change which makes it's obsolete for
those already using it. They are often built into large, real applications
which can't constantly revisit every issue every release. Users have to
know that using a boost library will save them effort, not burden them
with a new maintainence task

> Today everybody is (maybe slowly) moving towards UTF-8

It wasn't that long ago that "everybody" was moving to wchar/wstring
to support unicode. And a lot of people did. You can't know the
future and you can't impose your view of it on everyone else.

> and creating a new string class/wrapper for UTF-8 that nobody uses,

lol - well no one is going to use it until it exists.

> IMO, encourages the usage of the old ANSI encodings.

I'm not see this at all.

> Maybe a better course of action would be to create ansi_str_t with
> the encoding tags for the legacy ANSI-encoded strings, which could be
> obsoleted in the future,

obsoleted by whom?

> and use std::string as the default class for UTF-8 strings.

Thereby breaking millions (billions?) of lines of currently working programs

> We will have to do this transition anyway at one point,

One can't know that

>so why not do it now.

I confess I haven't followed this discussion in all it's detail, so please
bear
with me if I'm repeating something someone said or have missed something
obvious.

To my way of thinking, the way std::string is used is often equivalent to
vector<char>. It has extra sauce, but it's not all that much about
manipulating
text as it is about manipulating a string of bytes (named characters). So
what's
wrong with something like the following:

struct utf8string : public std::string {
    struct iterator {
        const char * operator++(); // move to next code point,
        utf8char operator*(); // return next utf8 char etc.
        // ...
    };
    // maybe some other stuff - e.g. trap non-sensical operations
};

and while you're at it

struct ascii_string : public std::string {
    std::local m_l; //
    ascii_string & operator+=(char c) {
        assert(c < 128);
    }
    // etc.
};

struct jis_string : public std::string {
    // etc.
};

and while your at it, if you've got nothing else to do

struct ebcdc_string : public std::string {
    ascii_string & operator+=(char c) {
        assert(c < 128);
    }
    // etc.
};

Just a thought.

Robert Ramey


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk