Boost logo

Boost :

Subject: [boost] "Best so far" for C level i/o of Unicode text with Windows console
From: Alf P. Steinbach (alf.p.steinbach+usenet_at_[hidden])
Date: 2011-11-03 19:04:52


This may be of interest to developers of Boost libraries that deal with
text.

I found that the Visual C++ implementation of the C library i/o
generally does not support console input of international characters. It
can deal with narrow character input from the current codepage, if that
codepage is not UTF-8. For example:

<code>
#include <stdio.h>

int main()
{
     printf( "? " );
     char buffer[80];
     scanf( "%s", buffer );
     printf( "%s\n", buffer );
}
</code>

<result>
d:\dev\test> chcp 1252
Active code page: 1252

d:\dev\test> utf8test
? abcæøå
abcæøå

d:\dev\test> chcp 65001
Active code page: 65001

d:\dev\test> utf8test
? abcæøå
��,

d:\dev\test> _
</result>

I placed a comment about this at the blog of Microsoft's Unicode guru
Michael Kaplan,

http://blogs.msdn.com/b/michkap/archive/2008/03/18/8306597.aspx

In particular, with active codepage 65001 (UTF-8), functions such as
wscanf just fail outright on non-ASCII characters.

The IMO best compromise that I have managed to come up with is a kind of
hybrid, where the active input codepage is set to 1252, Windows ANSI
Western, because that's a superset of Latin 1 which is a subset of
Unicode. Output is converted to UTF-8. Input from e.g. a file accepts UTF-8.

This means that e.g. Norwegian text can be input interactively
(restricted to Latin 1 character set) or from a pipe or file (as UTF-8)
without problems, and can be automatically translated (by the C i/o
level) to UTF-16 encoding inside the program. However, I'm guessing that
the C level input will garble e.g. Russian text no matter what one does,
if one desires some Unicode based encoding inside the program. The
Windows and Visual C++ support is just full of bugs -- e.g. as the
crash of `more` showed in an earlier thread.

Initialization of streams for the compromize scheme:

static void msvcCompatibleInit()
{
     struct Fix
     {
         static void mode( FILE* f, char const errorText[] )
         {
             int const fileNo = _fileno( f );
             bool const isConsoleInput = (f == stdin && _isatty(
fileNo ));

             if( isConsoleInput )
             {
                 // Bytes are received as per the active codepage in the
console.
                 // Except if that active codepage is 65001, in which
case non-ASCII
                 // characters fail. Also, _setmode just causes
non-ASCII fail.
                 //
                 // Setting the console codepage to 1252 might be
practically helpful,
                 // since cp 1252, Windows ANSI Western, is a superset
of Latin 1
                 // which is a subset of Unicode. However, non-Latin 1
characters will
                 // then be incorrect, just as with web pages in the old
days.
             }
             else
             {
                 int const newMode = _setmode( fileNo, _O_U8TEXT );
                 hopefully( newMode != -1 )
                     || throwX( errorText );
             }
         }
     };

     Fix::mode( stdin, "_setmode stdin failed" );
     Fix::mode( stdout, "_setmode stdout failed" );
     Fix::mode( stderr, "_setmode stderr failed" );
}

Cheers & hth.,

- Alf


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk