Boost logo

Boost :

From: Tilman Kuepper (kuepper_at_[hidden])
Date: 2004-08-02 09:33:16


Hello world,

I took a closer look at the UTF-8 codecvt facet which is part
of the program_options library. A test program is attached.

The last assert (in the Read-function) fails with g++ (GCC)
3.3.3 (Debian 20040429).

After some debugging I think I found the problem: The function
utf8_codecvt_facet_wchar_t::do_in() converts only valid (com-
plete) UTF-8 sequences into internal (wchar_t) characters. In
case the input buffer ends with an incomplete UTF-8 character,
do_in() returns codecvt_base::partial and points from_next at
the beginning of this incomplete UTF-8 sequence.

Obviously the library (libstdc++) is surprised by the fact
that the codecvt facet stops the translation, although there
is still room in the output buffer (i. e. to_next != to_end)
and not all input characters have been processed (from_next
!= from_end).

As a consequence the for-loop in the test program stops too
early (wifstream not "good" any longer) and assert(pos ==
wstr.size()) fails.

Is this a known issue with the GNU library or with the UTF-8
conversion facet? And what can be done?

Best regards from Aachen,
Tilman

PS: You can find the codecvt facet cpp/hpp files in the
folders /boost/boost/program_options/detail/ and in
/boost/libs/program_options/src/

PPS: There seems to be no problem with VC7.1/Dinkumware.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

#include "utf8_codecvt_facet.hpp"

#include <string>
#include <iostream>
#include <fstream>
#include <locale>
#include <cassert>

using namespace std;
using namespace boost;
using namespace boost::program_options;
using namespace boost::program_options::detail;

namespace { wstring wstr; }

inline bool IsUnicode(wchar_t wch)
{
    if(wch >= 0x00D800 && wch < 0x00E000) return false;
    if(wch >= 0x00FFFE && wch < 0x010000) return false;
    if(wch >= 0x110000) return false;
    return true;
}

inline void MakeTestStr()
{
    const size_t loops = 1;

    wstr.clear();
    wstr.reserve(loops * 0x110000);
    for(size_t i = 0; i < loops; ++i)
        for(wchar_t wch = 0; wch < WCHAR_MAX; ++wch)
            if(IsUnicode(wch)) wstr += wch;
}

inline void Write()
{
    locale loc;
    locale utf8loc(loc, new utf8_codecvt_facet<wchar_t, char>());
    wofstream f;
    f.imbue(utf8loc);
    f.open("test.utf8", ios::binary);
    f << wstr;
    assert(f);
}

inline void Read()
{
    locale loc;
    locale utf8loc(loc, new utf8_codecvt_facet<wchar_t, char>());
    wifstream f;
    f.imbue(utf8loc);
    f.open("test.utf8", ios::binary);

    wchar_t wch;
    size_t pos = 0;
    for(; f.get(wch); ++pos)
        assert(wstr[pos] == wch);

    assert(pos == wstr.size()); // ***PROBLEM HERE***
}

int main()
{
    MakeTestStr();
    Write();
    Read();
}


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk