Boost logo

Boost Users :

Subject: Re: [Boost-users] boost::filesystem::path in UTF-8 on Windows
From: Andrey Moshbear (andrey.vul_at_[hidden])
Date: 2011-11-05 14:00:30


On Sat, Nov 5, 2011 at 12:43, John M. Dlugosz <mpbecey7gu_at_[hidden]> wrote:
> On Fri, Nov 4, 2011 at 11:28, Igor R <boost.lists_at_[hidden]> wrote:
>>>>
>>>> On Windows you should convert it to utf16.
>
> I know that is how it stores it internally.
> My question is "how".  Given that I have data that are file names and
> encoded in UTF-8, how do I make the Boost path class accept them, and
> operate conveniently enough to be worth using instead of plain strings?
>
> On Fri, Nov 4, 2011 at 22:54, Andrey Moshbear <andrey.vul_at_[hidden]> wrote:
>>
>> For my rewrite of UTF-8 to UTF-16/32, look at
>> https://github.com/moshbear/fastcgipp/blob/master/src/utf8_cvt.cpp.
>
> So this is a codecvt that I should use as the extra argument, that works
> better than the undocumented one that came with Boost?
>

And the boost utf8<->utf32 one is indeed documented:
http://www.boost.org/doc/libs/1_47_0/libs/serialization/doc/codecvt.html.
It's just not going to work correctly with extended Unicode if you
decide to use 16-bit char as the char type.

The code itself isn't that self-documenting, though, which makes
hacking in the U+10FFFF limit and surrogate pair parsing more
work than simply rewriting the codecvt.

>
> And, the implicit answer is that this is indeed how I do it?
>
> But:
>
> 1) When I write something like
>   path p2= p1 / "Foo" / s1 / name;
> there is no place to pass the extra codecvt argument.  I thought it might
> take strings and keep the existing encoding, but it actually uses the
> default code page.  How can I use path in a simple and convenient manner
> given that in this program all the strings I will use with it are already in
> UTF-8?
>

Make a std::wstringstream.
Imbue it with locale(locale::classic(), new Utf8_cvt).
Use operator<< to build up a path.
Call .str() to get the string.
Pass that to the path constructor.

> 2) How can I write a line like:
>   path p2 (somestring, codecvt());
> in a portable manner?  On the Mac the internal representation is char, so
> will it object to having the codecvt passed?  Once I set things up, I want
> the bulk of the source code to be the same on all platforms, so writing the
> argument on Windows and leaving it out on Mac is not acceptable.
>

Because Mac assumes char, use of wide UTF isn't going to work because
the libraries look for char 0 as terminators,
not wchar_t 0.

The best solution is to #ifdef _WIN32 the utf-8 to utf-16 code.


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net