Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Artyom (artyomtnk_at_[hidden])
Date: 2011-01-16 15:56:23


> The system I'm now using for my programs might interest you.
>
> I have four classes: ascii_t, utf8_t, utf16_t, and utf32_t. Assigning
> one type to another automatically converts it to the target type during
> the copy. (Converting to ascii_t will throw an exception if a resulting
> character won't fit into eight bits.)
>

If so (and this is what I see in code) ASCII is misleading.
It should be called Latin1/ISO-8859-1 but not ASCII.

> An std::string is assumed to be ASCII-encoded. If you really do have
> UTF-8-encoded data to get into the system, you either assign it to a
> utf8_t using operator*, or use a static function utf8_t::precoded.
> std::wstring is assumed to be utf16_t- or utf32_t-encoded already,
> depending on the underlying character width for the OS.

This is very bad assumption. To be honest, I've written lots
of code with direct UTF-8 strings in it (Boost.Locale tests)
and this worked perfectly well with MSVC, GCC and Intel
compilers (as long as I work with char * not L"") and this works
file all the time.

It is bad assumption, the encoding should be byte string
which may be UTF-8 or may be not.

There are two cases we need to treat strings and encoding:

1. We handle human language or text - collation, formatting etc.
2. We want to access Windows Wide API that is not locale agnostic.

>
> For portable OS-interface functions, there's a typedef (os::native_t)
> to the type that the OS's API functions need. For Linux-based systems,
> it's utf8_t; for Windows, utf16_t. There's also a typedef
> (os::unicode_t) that is utf32_t on Linux and utf16_t on Windows, but
> I'm not sure there's a need for that.
>

When you work with Linux and Unix at all you should not change encoding.
There were discussions about it. For example following code:

    #include <fstream>
    #include <cstdio>
    #include <assert.h>

    int main()
    {
        {
            std::ofstream t("\xFF\xFF.txt");
            if(!t) {
               /// Not valid for this os - Mac OS X
               return 0;
            }
            t << "test";
            t.close();
        }
        {
            std::ifstream t("\xFF\xFF.txt");
            std::string s;
            t >> s;
            assert( s=="test");
            t.close();
        }
        std::remove("\xFF\xFF.txt");
    }

Which is valid code and works regardless of current locale on POSIX
platforms.

Using your API it would fail as it holds some assumptions on encoding.

> There are some parts of the code that could use polishing, but I like
> the overall design, and I'm finding it pretty easy to work with. Anyone
> interested in seeing the code?

IMHO, I don't think that inventing new strings or new text
containers is a way to go. std::string is perfectly fine as long
as you code in consistent way.

Artyom

      


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk