Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Artyom (artyomtnk_at_[hidden])
Date: 2011-01-17 13:09:13


> I've done some research, and it looks like it would require little
> effort to create an os::string_t type that uses the current locale, and
> assume all raw std::strings that contain eight-bit values are coded in
> that instead.
>
> Design-wise, ascii_t would need to change slightly after this, to throw
> on anything that can't fit into a *seven*-bit value, rather than
> eight-bit. I'll add the default-character option to both types as well,
> and maybe make other improvements as I have time.
>

Unfortunately this is not the correct approach as well.

For example why do you think it is safe to pass ASCII subset of utf-8
to current non-utf-8 locale?

For example Shift-JIS that is in use on Windows/ANSI API has different
subset in 0-127 range - it is not ASCII!

Also if you want to use std::codecvt facet...
Don't relay on them unless you know where they come from!

1. By default they are noop - in the default C locale

2. Under most compilers they are not implemented properly.

   OS \ Compiler MSVC GCC SunOS/stlport SunOS/standard
   -------------------------------------------------------------------
   Windows ok none - -
   Linux - ok ? ?
   Mac OS X - none - -
   FreeBSD - none - -
   Solaris - none buggy! ok-but-non-standard

Bottom lines don't relate on "current locale" :-)

>
> Artyom, since you seem to have more experience with this stuff than I,
> what do you think? Would those alterations take care of your objections?
>

The rule of thumb is following:

- When you hadle with strings as text storage just use std::string

- When you do a system call

  a) on Posix - pass it as is
  b) on Windows - Convert to Wide API from UTF-8
  
- When handling text as text (i.e. formatting, collation etc.)
  use good library.

I would strongly recommend to read the answer of Pavel Radzivilovsky
on Stackoverflow:

  
http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful/1855375#1855375

And he is hard-core-windows-programmer, designer, architext and developer
and still he had chosen UTF-8!

The problem that the issue is so completated that making
it absolutly general and on the other hand right is only
one - decide what you are working with and stick with it.

In CppCMS project I work with (and I developed Boost.Locale
because of it) I stick by default with UTF-8 and use plain
std::string - works like a charm.

Invening "special unicode strings or storage" does not
improve anybody's understanding of Unicode neither improve
its handing.

Best,
  Artyom

      


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk