Boost logo

Boost :

From: Tilman Kuepper (kuepper_at_[hidden])
Date: 2004-08-02 09:33:16

Hello world,

I took a closer look at the UTF-8 codecvt facet which is part
of the program_options library. A test program is attached.

The last assert (in the Read-function) fails with g++ (GCC)
3.3.3 (Debian 20040429).

After some debugging I think I found the problem: The function
utf8_codecvt_facet_wchar_t::do_in() converts only valid (com-
plete) UTF-8 sequences into internal (wchar_t) characters. In
case the input buffer ends with an incomplete UTF-8 character,
do_in() returns codecvt_base::partial and points from_next at
the beginning of this incomplete UTF-8 sequence.

Obviously the library (libstdc++) is surprised by the fact
that the codecvt facet stops the translation, although there
is still room in the output buffer (i. e. to_next != to_end)
and not all input characters have been processed (from_next
!= from_end).

As a consequence the for-loop in the test program stops too
early (wifstream not "good" any longer) and assert(pos ==
wstr.size()) fails.

Is this a known issue with the GNU library or with the UTF-8
conversion facet? And what can be done?

Best regards from Aachen,

PS: You can find the codecvt facet cpp/hpp files in the
folders /boost/boost/program_options/detail/ and in

PPS: There seems to be no problem with VC7.1/Dinkumware.


#include "utf8_codecvt_facet.hpp"

#include <string>
#include <iostream>
#include <fstream>
#include <locale>
#include <cassert>

using namespace std;
using namespace boost;
using namespace boost::program_options;
using namespace boost::program_options::detail;

namespace { wstring wstr; }

inline bool IsUnicode(wchar_t wch)
    if(wch >= 0x00D800 && wch < 0x00E000) return false;
    if(wch >= 0x00FFFE && wch < 0x010000) return false;
    if(wch >= 0x110000) return false;
    return true;

inline void MakeTestStr()
    const size_t loops = 1;

    wstr.reserve(loops * 0x110000);
    for(size_t i = 0; i < loops; ++i)
        for(wchar_t wch = 0; wch < WCHAR_MAX; ++wch)
            if(IsUnicode(wch)) wstr += wch;

inline void Write()
    locale loc;
    locale utf8loc(loc, new utf8_codecvt_facet<wchar_t, char>());
    wofstream f;
    f.imbue(utf8loc);"test.utf8", ios::binary);
    f << wstr;

inline void Read()
    locale loc;
    locale utf8loc(loc, new utf8_codecvt_facet<wchar_t, char>());
    wifstream f;
    f.imbue(utf8loc);"test.utf8", ios::binary);

    wchar_t wch;
    size_t pos = 0;
    for(; f.get(wch); ++pos)
        assert(wstr[pos] == wch);

    assert(pos == wstr.size()); // ***PROBLEM HERE***

int main()

Boost list run by bdawes at, gregod at, cpdaniel at, john at