|
Boost : |
Subject: [boost] "Best so far" for C level i/o of Unicode text with Windows console
From: Alf P. Steinbach (alf.p.steinbach+usenet_at_[hidden])
Date: 2011-11-03 19:04:52
This may be of interest to developers of Boost libraries that deal with
text.
I found that the Visual C++ implementation of the C library i/o
generally does not support console input of international characters. It
can deal with narrow character input from the current codepage, if that
codepage is not UTF-8. For example:
<code>
#include <stdio.h>
int main()
{
printf( "? " );
char buffer[80];
scanf( "%s", buffer );
printf( "%s\n", buffer );
}
</code>
<result>
d:\dev\test> chcp 1252
Active code page: 1252
d:\dev\test> utf8test
? abcæøå
abcæøå
d:\dev\test> chcp 65001
Active code page: 65001
d:\dev\test> utf8test
? abcæøå
��,
d:\dev\test> _
</result>
I placed a comment about this at the blog of Microsoft's Unicode guru
Michael Kaplan,
http://blogs.msdn.com/b/michkap/archive/2008/03/18/8306597.aspx
In particular, with active codepage 65001 (UTF-8), functions such as
wscanf just fail outright on non-ASCII characters.
The IMO best compromise that I have managed to come up with is a kind of
hybrid, where the active input codepage is set to 1252, Windows ANSI
Western, because that's a superset of Latin 1 which is a subset of
Unicode. Output is converted to UTF-8. Input from e.g. a file accepts UTF-8.
This means that e.g. Norwegian text can be input interactively
(restricted to Latin 1 character set) or from a pipe or file (as UTF-8)
without problems, and can be automatically translated (by the C i/o
level) to UTF-16 encoding inside the program. However, I'm guessing that
the C level input will garble e.g. Russian text no matter what one does,
if one desires some Unicode based encoding inside the program. The
Windows and Visual C++ support is just full of bugs -- e.g. as the
crash of `more` showed in an earlier thread.
Initialization of streams for the compromize scheme:
static void msvcCompatibleInit()
{
struct Fix
{
static void mode( FILE* f, char const errorText[] )
{
int const fileNo = _fileno( f );
bool const isConsoleInput = (f == stdin && _isatty(
fileNo ));
if( isConsoleInput )
{
// Bytes are received as per the active codepage in the
console.
// Except if that active codepage is 65001, in which
case non-ASCII
// characters fail. Also, _setmode just causes
non-ASCII fail.
//
// Setting the console codepage to 1252 might be
practically helpful,
// since cp 1252, Windows ANSI Western, is a superset
of Latin 1
// which is a subset of Unicode. However, non-Latin 1
characters will
// then be incorrect, just as with web pages in the old
days.
}
else
{
int const newMode = _setmode( fileNo, _O_U8TEXT );
hopefully( newMode != -1 )
|| throwX( errorText );
}
}
};
Fix::mode( stdin, "_setmode stdin failed" );
Fix::mode( stdout, "_setmode stdout failed" );
Fix::mode( stderr, "_setmode stderr failed" );
}
Cheers & hth.,
- Alf
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk