Boost logo

Boost :

Subject: [boost] [filesystem and beyond] Narrow strings be UTF-8
From: Yakov Galka (ybungalobill_at_[hidden])
Date: 2011-10-27 15:31:19


On Wed, Oct 26, 2011 at 22:13, Beman Dawes <bdawes_at_[hidden]> wrote:

> (2) V3 may work OK with the Microsoft 65001 UTF-8 codepage, although
> I've never used it myself and you would have to pass in a UTF-8
> encoded narrow character name.
>

If it had been possible, it would simplify everything. Unfortunately, you
cannot set UTF-8 codepage for windows API functions (you can for the console
though). Microsoft isn't interested in making portability easier, so I don't
see them adding the support in the nearest future. But they could. UTF-8
should be considered as the only default narrow encoding on windows, because
all other ANSI encodings are not unicode-aware.

On Wed, Oct 26, 2011 at 22:47, Beman Dawes <bdawes_at_[hidden]> wrote:

> On Wed, Oct 26, 2011 at 6:24 AM, Yakov Galka <ybungalobill_at_[hidden]>
> wrote:
>
> [...]
>
> Even if you fix the Unicode problems,
>
> What Unicode problems are you running into? Although there are some
> locale related tickets outstanding, I'm not aware of any Unicode
> issues.
>

1) The one that was brought up in the previous thread.
2) The complexity of writing portable unicode-aware code: currently you're
forcing me to
    a) use wstring on windows, or if I prefer to use my favorite portable
UTF-8 encoded strings
    b) write all the boilerplate code that passes codecvt everywhere as a
parameter (see below why ¬imbue()).

In both cases you're shifting the complexity to the higher-level code. It's
not a kind thing for you as a low-level library developer to do, The library
is expected to ℍ𝕚𝕕𝕖 the platform differences by providing a uniform
interface.

⇒ Myth: Using the native encoding on each platform results in portable code.
‽ In some definition of 'portable' definitely yes. But not when things are
shared among different platforms. It starts with files transferred between
different systems and ends with the source code itself (there is a different
between "" and L""). Uniformity == simplicity.

Consider a simple case of loading a path from some project file and loading
the referenced file: The project file is encoded in UTF-8 making it portable
among all systems with CHAR_BIT == 8.

// Option a)
#include "codecvt_implementation.h"
std::basic_ifstream<native_char> fin("project.file");
std::basic_string<native_char> str;
fin.imbue(locale(fin.getloc(), new utf8_to_native_codecvt()));
getline(fin, str);
fs::ifstream fin2(project_path/str);

// Option b)
#include "codecvt_implementation.h"
std::ifstream fin("project.file");
std::string str;
getline(fin, str);
fs::ifstream fin2(fs::path(project_path).append(str,
utf8_to_native_codecvt()));

// c) How it could be done
std::ifstream fin("project.file");
std::string str;
getline(fin, str);
fs::ifstream fin2(project_path/str);

⇒ Use boost⸬filesystem⸬imbue to convert b to c.
‽ Who is responsible for calling imbue()? I'm writing library code. I'm not
allowed to change the global-state.

⇒ This code will break:
int main(int argc, char* argv[]) {
    fs::ifstream fin(argv[1]);
}
‽ It works fine for ASCII characters on all sane platform. For non-ASCII, I
don't care. It's already not unicode-aware if the native encoding is not
UTF-8 (which can't be so on Windows). If the writer of this code really
cares about internationalization, she can use boost⸬program_options
(assuming it's also changed to follow the UTF-8 convention). Otherwise she's
a hypocrite.

⇒ UTF-8 is slow.
‽ Compared to what? You haven't measured this.
On windows: we must do complicated operations on paths before we pass them
to the OS anyway (system_complete, prepend \\?\). Most of the strings
contain ASCII characters so std::strings take less memory, thus decreasing
cache thrashing in other parts of the program.
On other OSes: the encoding is already almost always UTF-8.

Experience shows that the small overhead (if it's an overhead at all) is not
the bottleneck. Many cross-platform libraries already switched to UTF-8 for
narrow-chars (see one of the previous discussion for a list), and I don't
see a reason why boost can't be the next.

-- 
Yakov

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk