Boost logo

Boost :

Subject: Re: [boost] Change to guidelines for characters in C++ source files
From: Gavin Lambert (gavinl_at_[hidden])
Date: 2015-06-29 03:48:01


On 26/06/2015 11:33, Mateusz Loskot wrote:
> On 26 June 2015 at 01:15, Andrey Semashev <andrey.semashev_at_[hidden]> wrote:
>> Why not just always assume UTF-8, whether there is BOM or not? I don't think
>> UTF-8 BOM makes much sense, and I don't think editors commonly insert one.
>
> Also, http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf says:
>
> "Use of a BOM is neither required nor recommended for UTF-8,
> but may be encountered in contexts where UTF-8 data is converted from
> other encoding forms that use a BOM or where the BOM is used as a
> UTF-8 signature. "

That last part is relevant here -- a UTF-8 BOM at the start of the file
is used as a signature that the file contains UTF-8 content. In the
absence of that signature, any reader of the file must guess at the
content encoding.

The problem is that various parties are inconsistent about what a file
without a BOM actually means, but most commonly it means that the file
is assumed to be in some default system locale. On modern Linux, that
usually means UTF-8 anyway, but that is not universal, and it is never
the case on Windows (it means that the file will be interpreted in
whatever the user's chosen "language for non-Unicode programs" is, which
will vary depending on the user's country, preferred languages, and
whether they've been playing Japanese novel games recently or not). As
such it is vastly safer to include the BOM than to omit it. (One
exception might be for shell scripts and other text-like files that care
about their first few bytes and aren't expecting BOMs.)

In some cases the reader is expected to try to parse the file as UTF-8
and then fall back to some other encoding if an invalid UTF-8 character
sequence is encountered. This is quite aggravating both for the people
expected to write such software and also for the users who get their
text misinterpreted by such heuristics, and whoever suggests that was a
sensible choice for a default action should get thwapped upside the
head. (As an explicit "try to recover unknown format document" option,
sure. But not a default.)

If you're looking for authority, you might want to read
http://unicode.org/faq/utf_bom.html#BOM as well. The key point being
that the recommendation to not use BOMs is for situations in which the
encoding is already known in advance (such as databases, or protocols
that explicitly transmit an encoding in an envelope). Files are not an
example of that.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk