|
Boost : |
From: Tilman Kuepper (kuepper_at_[hidden])
Date: 2004-08-02 09:33:16
Hello world,
I took a closer look at the UTF-8 codecvt facet which is part
of the program_options library. A test program is attached.
The last assert (in the Read-function) fails with g++ (GCC)
3.3.3 (Debian 20040429).
After some debugging I think I found the problem: The function
utf8_codecvt_facet_wchar_t::do_in() converts only valid (com-
plete) UTF-8 sequences into internal (wchar_t) characters. In
case the input buffer ends with an incomplete UTF-8 character,
do_in() returns codecvt_base::partial and points from_next at
the beginning of this incomplete UTF-8 sequence.
Obviously the library (libstdc++) is surprised by the fact
that the codecvt facet stops the translation, although there
is still room in the output buffer (i. e. to_next != to_end)
and not all input characters have been processed (from_next
!= from_end).
As a consequence the for-loop in the test program stops too
early (wifstream not "good" any longer) and assert(pos ==
wstr.size()) fails.
Is this a known issue with the GNU library or with the UTF-8
conversion facet? And what can be done?
Best regards from Aachen,
Tilman
PS: You can find the codecvt facet cpp/hpp files in the
folders /boost/boost/program_options/detail/ and in
/boost/libs/program_options/src/
PPS: There seems to be no problem with VC7.1/Dinkumware.
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
#include "utf8_codecvt_facet.hpp"
#include <string>
#include <iostream>
#include <fstream>
#include <locale>
#include <cassert>
using namespace std;
using namespace boost;
using namespace boost::program_options;
using namespace boost::program_options::detail;
namespace { wstring wstr; }
inline bool IsUnicode(wchar_t wch)
{
if(wch >= 0x00D800 && wch < 0x00E000) return false;
if(wch >= 0x00FFFE && wch < 0x010000) return false;
if(wch >= 0x110000) return false;
return true;
}
inline void MakeTestStr()
{
const size_t loops = 1;
wstr.clear();
wstr.reserve(loops * 0x110000);
for(size_t i = 0; i < loops; ++i)
for(wchar_t wch = 0; wch < WCHAR_MAX; ++wch)
if(IsUnicode(wch)) wstr += wch;
}
inline void Write()
{
locale loc;
locale utf8loc(loc, new utf8_codecvt_facet<wchar_t, char>());
wofstream f;
f.imbue(utf8loc);
f.open("test.utf8", ios::binary);
f << wstr;
assert(f);
}
inline void Read()
{
locale loc;
locale utf8loc(loc, new utf8_codecvt_facet<wchar_t, char>());
wifstream f;
f.imbue(utf8loc);
f.open("test.utf8", ios::binary);
wchar_t wch;
size_t pos = 0;
for(; f.get(wch); ++pos)
assert(wstr[pos] == wch);
assert(pos == wstr.size()); // ***PROBLEM HERE***
}
int main()
{
MakeTestStr();
Write();
Read();
}
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk