Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Artyom (artyomtnk_at_[hidden])
Date: 2011-01-17 13:09:13

> I've done some research, and it looks like it would require little
> effort to create an os::string_t type that uses the current locale, and
> assume all raw std::strings that contain eight-bit values are coded in
> that instead.
> Design-wise, ascii_t would need to change slightly after this, to throw
> on anything that can't fit into a *seven*-bit value, rather than
> eight-bit. I'll add the default-character option to both types as well,
> and maybe make other improvements as I have time.

Unfortunately this is not the correct approach as well.

For example why do you think it is safe to pass ASCII subset of utf-8
to current non-utf-8 locale?

For example Shift-JIS that is in use on Windows/ANSI API has different
subset in 0-127 range - it is not ASCII!

Also if you want to use std::codecvt facet...
Don't relay on them unless you know where they come from!

1. By default they are noop - in the default C locale

2. Under most compilers they are not implemented properly.

   OS \ Compiler MSVC GCC SunOS/stlport SunOS/standard
   Windows ok none - -
   Linux - ok ? ?
   Mac OS X - none - -
   FreeBSD - none - -
   Solaris - none buggy! ok-but-non-standard

Bottom lines don't relate on "current locale" :-)

> Artyom, since you seem to have more experience with this stuff than I,
> what do you think? Would those alterations take care of your objections?

The rule of thumb is following:

- When you hadle with strings as text storage just use std::string

- When you do a system call

  a) on Posix - pass it as is
  b) on Windows - Convert to Wide API from UTF-8
- When handling text as text (i.e. formatting, collation etc.)
  use good library.

I would strongly recommend to read the answer of Pavel Radzivilovsky
on Stackoverflow:

And he is hard-core-windows-programmer, designer, architext and developer
and still he had chosen UTF-8!

The problem that the issue is so completated that making
it absolutly general and on the other hand right is only
one - decide what you are working with and stick with it.

In CppCMS project I work with (and I developed Boost.Locale
because of it) I stick by default with UTF-8 and use plain
std::string - works like a charm.

Invening "special unicode strings or storage" does not
improve anybody's understanding of Unicode neither improve
its handing.



Boost list run by bdawes at, gregod at, cpdaniel at, john at