Boost logo

Boost :

Subject: [boost] "Best so far" for C level i/o of Unicode text with Windows console
From: Alf P. Steinbach (alf.p.steinbach+usenet_at_[hidden])
Date: 2011-11-03 19:04:52

This may be of interest to developers of Boost libraries that deal with

I found that the Visual C++ implementation of the C library i/o
generally does not support console input of international characters. It
can deal with narrow character input from the current codepage, if that
codepage is not UTF-8. For example:

#include <stdio.h>

int main()
     printf( "? " );
     char buffer[80];
     scanf( "%s", buffer );
     printf( "%s\n", buffer );

d:\dev\test> chcp 1252
Active code page: 1252

d:\dev\test> utf8test
? abcæøå

d:\dev\test> chcp 65001
Active code page: 65001

d:\dev\test> utf8test
? abcæøå

d:\dev\test> _

I placed a comment about this at the blog of Microsoft's Unicode guru
Michael Kaplan,

In particular, with active codepage 65001 (UTF-8), functions such as
wscanf just fail outright on non-ASCII characters.

The IMO best compromise that I have managed to come up with is a kind of
hybrid, where the active input codepage is set to 1252, Windows ANSI
Western, because that's a superset of Latin 1 which is a subset of
Unicode. Output is converted to UTF-8. Input from e.g. a file accepts UTF-8.

This means that e.g. Norwegian text can be input interactively
(restricted to Latin 1 character set) or from a pipe or file (as UTF-8)
without problems, and can be automatically translated (by the C i/o
level) to UTF-16 encoding inside the program. However, I'm guessing that
the C level input will garble e.g. Russian text no matter what one does,
if one desires some Unicode based encoding inside the program. The
Windows and Visual C++ support is just full of bugs -- e.g. as the
crash of `more` showed in an earlier thread.

Initialization of streams for the compromize scheme:

static void msvcCompatibleInit()
     struct Fix
         static void mode( FILE* f, char const errorText[] )
             int const fileNo = _fileno( f );
             bool const isConsoleInput = (f == stdin && _isatty(
fileNo ));

             if( isConsoleInput )
                 // Bytes are received as per the active codepage in the
                 // Except if that active codepage is 65001, in which
case non-ASCII
                 // characters fail. Also, _setmode just causes
non-ASCII fail.
                 // Setting the console codepage to 1252 might be
practically helpful,
                 // since cp 1252, Windows ANSI Western, is a superset
of Latin 1
                 // which is a subset of Unicode. However, non-Latin 1
characters will
                 // then be incorrect, just as with web pages in the old
                 int const newMode = _setmode( fileNo, _O_U8TEXT );
                 hopefully( newMode != -1 )
                     || throwX( errorText );

     Fix::mode( stdin, "_setmode stdin failed" );
     Fix::mode( stdout, "_setmode stdout failed" );
     Fix::mode( stderr, "_setmode stderr failed" );

Cheers & hth.,

- Alf

Boost list run by bdawes at, gregod at, cpdaniel at, john at