|
Boost : |
Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Artyom (artyomtnk_at_[hidden])
Date: 2011-01-16 15:56:23
> The system I'm now using for my programs might interest you.
>
> I have four classes: ascii_t, utf8_t, utf16_t, and utf32_t. Assigning
> one type to another automatically converts it to the target type during
> the copy. (Converting to ascii_t will throw an exception if a resulting
> character won't fit into eight bits.)
>
If so (and this is what I see in code) ASCII is misleading.
It should be called Latin1/ISO-8859-1 but not ASCII.
> An std::string is assumed to be ASCII-encoded. If you really do have
> UTF-8-encoded data to get into the system, you either assign it to a
> utf8_t using operator*, or use a static function utf8_t::precoded.
> std::wstring is assumed to be utf16_t- or utf32_t-encoded already,
> depending on the underlying character width for the OS.
This is very bad assumption. To be honest, I've written lots
of code with direct UTF-8 strings in it (Boost.Locale tests)
and this worked perfectly well with MSVC, GCC and Intel
compilers (as long as I work with char * not L"") and this works
file all the time.
It is bad assumption, the encoding should be byte string
which may be UTF-8 or may be not.
There are two cases we need to treat strings and encoding:
1. We handle human language or text - collation, formatting etc.
2. We want to access Windows Wide API that is not locale agnostic.
>
> For portable OS-interface functions, there's a typedef (os::native_t)
> to the type that the OS's API functions need. For Linux-based systems,
> it's utf8_t; for Windows, utf16_t. There's also a typedef
> (os::unicode_t) that is utf32_t on Linux and utf16_t on Windows, but
> I'm not sure there's a need for that.
>
When you work with Linux and Unix at all you should not change encoding.
There were discussions about it. For example following code:
#include <fstream>
#include <cstdio>
#include <assert.h>
int main()
{
{
std::ofstream t("\xFF\xFF.txt");
if(!t) {
/// Not valid for this os - Mac OS X
return 0;
}
t << "test";
t.close();
}
{
std::ifstream t("\xFF\xFF.txt");
std::string s;
t >> s;
assert( s=="test");
t.close();
}
std::remove("\xFF\xFF.txt");
}
Which is valid code and works regardless of current locale on POSIX
platforms.
Using your API it would fail as it holds some assumptions on encoding.
> There are some parts of the code that could use polishing, but I like
> the overall design, and I'm finding it pretty easy to work with. Anyone
> interested in seeing the code?
IMHO, I don't think that inventing new strings or new text
containers is a way to go. std::string is perfectly fine as long
as you code in consistent way.
Artyom
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk