|
Boost : |
From: Sebastian Redl (sebastian.redl_at_[hidden])
Date: 2008-03-03 05:49:05
Kirit Sælensminde wrote:
> If an enclosing specification already tells us that it is Unicode and
> which encoding (i.e. HTTP and SMTP/MIME have this mechanism) then there
> shouldn't be a BOM.
Yes, if the mechanism tells us the endianness. Otherwise, the BOM is
still needed.
> There also should never be a BOM anywhere other than
> the start of a string/stream/file (if you concatenate you should remove
> inner ones). I think some old applications may incorrectly use a BOM as
> a zero width break too.
Not really incorrectly. 0xFFFE really was the zero-width non-breaking
space originally, but the special zero-width property led people to use
it as a BOM. Thus, a different character was designated as the new
ZWNBSP, and 0xFFFE was officially made the BOM. So the usage is only
incorrect in new applications.
> When decoding UTF-8 it is also useful to check that the character you
> just decoded is actually meant to use that number of UTF-8 bytes. For
> example, by zero padding you can encode an apostrophe as 2 bytes rather
> than 1. There are a number of security exploits centred around this and
> getting one means you're dealing with a buggy Unicode encoder at best,
> but more likely your software is under attack. I throw an exception to
> stop all processing in its tracks if I see this.
>
Phil's code does that, too.
Sebastian Redl
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk