Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Matus Chochlik (chochlik_at_[hidden])
Date: 2011-01-18 14:18:22


On Tue, Jan 18, 2011 at 8:03 PM, Robert Ramey <ramey_at_[hidden]> wrote:
> Matus Chochlik wrote:
>> On Tue, Jan 18, 2011 at 6:46 PM, Peter Dimov <pdimov_at_[hidden]>
>
>> Boost, as the cutting edge C++ library should try to enforce new
>> standards and not dwell on old and obsolete ones.
>
> A boost library can't just make a change which makes it's obsolete for
> those already using it.  They are often built into large, real applications
> which can't constantly revisit every issue every release.  Users have to
> know that using a boost library will save them effort, not burden them
> with a new maintainence task

I did not mean to say, that we just declare "from now on use std::string
for utf-8" and that is all. I am aware that this would require some work
to ensure as much backward compatibility as possible or even
create interface breaking changes.

>
>> Today everybody is (maybe slowly) moving towards UTF-8
>
> It wasn't that long ago that "everybody" was moving to wchar/wstring
> to support unicode. And a lot of people did.  You can't know the
> future and you can't impose your view of it on everyone else.

Yes but they never abandoned the ANSI encodings. This is why
nearly every big C++ library has its XYstring class that uses
ifdefs to switch between string/wstring.

>
>
>> and creating a new string class/wrapper for UTF-8 that nobody uses,
>
> lol - well no one is going to use it until it exists.

Is it necessary to explain that I did not mean it that way. What I meant
that we can hardly expect that everybody will adopt utf8_t when Boost
introduces it. As a consequence everybody will remain with std::string
and ANSI encodings.

>
>> IMO, encourages the usage of the old ANSI encodings.
>
> I'm not see this at all.
>
>> Maybe a better course of action would be to create ansi_str_t with
>> the encoding tags for the legacy ANSI-encoded strings, which could be
>> obsoleted in the future,
>
> obsoleted by whom?

By its authors. The ansi_str_t would serve as a temporary buffer before
everybody switches to utf-8.

>
>> and use std::string as the default class for UTF-8 strings.
>
> Thereby breaking millions (billions?) of lines of currently working programs

As a few people (who know a lot more about Unicode than I do) pointed out
it will be not that tragic (again I do *not* think that this change
does not involve
some work).

>
>> We will have to do this transition anyway at one point,
>
> One can't know that

Well the whole string-encoding-related mess has to be resolved
an to me it seems that UTF-8 is the candidate that will do it,
not because somebody says it but because it is already happening.
Just look at the Web and at the new releases of the major database
systems (I know this is not the whole IT sector but a relevant part
of it and many more examples could be found)

>
>>so why not do it now.
>
> I confess I haven't followed this discussion in all it's detail, so please
> bear
> with me if I'm repeating something someone said or have missed something
> obvious.
>
> To my way of thinking, the way std::string is used is often equivalent to
> vector<char>.  It has extra sauce, but it's not all that much about
> manipulating
> text as it is about manipulating a string of bytes (named characters).  So
> what's
> wrong with something like the following:
>
> struct utf8string : public std::string {
>    struct iterator {
>        const char * operator++(); // move to next code point,
>        utf8char operator*(); // return next utf8 char etc.
>        // ...
>    };
>    // maybe some other stuff - e.g. trap non-sensical operations
> };
>
> and while you're at it
>
> struct ascii_string : public std::string {
>    std::local m_l; //
>    ascii_string & operator+=(char c) {
>        assert(c < 128);
>    }
>    // etc.
> };
>
> struct jis_string : public std::string {
>    // etc.
> };
>
> and while your at it, if you've got nothing else to do
>
> struct ebcdc_string : public std::string {
>    ascii_string & operator+=(char c) {
>        assert(c < 128);
>    }
>    // etc.
> };
>
> Just a thought.

That instead of the currently used 2 string classes
you'll end up with N string classes. That thought
is not very appealing to me.

BR

Matus


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk