Boost logo

Boost :

Subject: Re: [boost] [filesystem and beyond] Narrow strings be UTF-8
From: Beman Dawes (bdawes_at_[hidden])
Date: 2011-10-28 09:53:08

On Thu, Oct 27, 2011 at 3:31 PM, Yakov Galka <ybungalobill_at_[hidden]> wrote:
> On Wed, Oct 26, 2011 at 22:13, Beman Dawes <bdawes_at_[hidden]> wrote:
>> On Wed, Oct 26, 2011 at 6:24 AM, Yakov Galka <ybungalobill_at_[hidden]>
>> wrote:
>> [...]
>> Even if you fix the Unicode problems,
>> What Unicode problems are you running into? Although there are some
>> locale related tickets outstanding, I'm not aware of any Unicode
>> issues.
> 1) The one that was brought up in the previous thread.

That resulted in a ticket being opened, to fix a problem specific to
MinGW. It will get fixed as time permits.

> 2) The complexity of writing portable unicode-aware code: currently you're
> forcing me to
>    a) use wstring on windows, or if I prefer to use my favorite portable
> UTF-8 encoded strings
>    b) write all the boilerplate code that passes codecvt everywhere as a
> parameter (see below why ¬imbue()).
> In both cases you're shifting the complexity to the higher-level code.
> It's
> not a kind thing for you as a low-level library developer to do,

I don't know of any other viable approaches. I'm sorry you find the
boilerplate objectionable, but I'm not about to change to a default
that would enforce UTF-8 or any other particular narrow string
encoding. That's a much wider problem than Boost.Filesystem.

> The library
> is expected to ℍ𝕚𝕕𝕖 the platform differences by providing a uniform
> interface.

Initially the plan was to provide both a uniform interface in terms of
syntax and semantics. User reaction to uniform syntax was very
positive, and I've tried to provide that to the maximum extent
possible as far as the API goes.

Uniform semantics turned out to be much more complex. Paths are one of
the areas where acknowledging the difference between generic paths and
native paths is something that users want and need.

> ⇒ Myth: Using the native encoding on each platform results in portable code.

Hum... I don't recall anyone every claiming that "native encoding on
each platform results in portable code". It is way more complex than

> ⇒ Use boost⸬filesystem⸬imbue to convert b to c.
> ‽ Who is responsible for calling imbue()? I'm writing library code. I'm not
> allowed to change the global-state.

Right. Library code writers have to avoid changing global state if
they want to keep users happy. Nothing unusual about Boost.Filesystem
in that respect.

> ⇒ This code will break:
> int main(int argc, char* argv[]) {
>    fs::ifstream fin(argv[1]);
> }
> ‽ It works fine for ASCII characters on all sane platform. For non-ASCII, I
> don't care. It's already not unicode-aware if the native encoding is not
> UTF-8 (which can't be so on Windows). If the writer of this code really
> cares about internationalization, she can use boost⸬program_options
> (assuming it's also changed to follow the UTF-8 convention). Otherwise she's
> a hypocrite.

I disagree with your assertion that the above code will break.

It is in all essential aspects the same as:

int main(int argc, char* argv[]) {
    std::ifstream fin(argv[1]);

While the results may not be what the coder expected, that's an issue
well beyond the scope of the standard library or Boost.Filesystem.

> ⇒ UTF-8 is slow.
> ‽ Compared to what? You haven't measured this.

Actually, I have measured it many times, and never found UTF-8 to be a
bottleneck on European, North American, or South American data sets. I
haven't measured UTF-8 with Asian data sets; they tended to use other

> Experience shows that the small overhead (if it's an overhead at all) is not
> the bottleneck. Many cross-platform libraries already switched to UTF-8 for
> narrow-chars (see one of the previous discussion for a list), and I don't
> see a reason why boost can't be the next.

Boost libraries could use UTF-8 for narrow characters as a default,
but only where they aren't interfacing with existing code and/or
operating systems.


Boost list run by bdawes at, gregod at, cpdaniel at, john at