From: Kirit Sælensminde (kirit.saelensminde_at_[hidden])
Date: 2008-03-02 21:31:14
Phil Endecott wrote:
> Sebastian Redl wrote:
>> It gets worse. I've tried to implement a very simple "kinda-shift"
>> encoding, UTF-16VE. That is, a UTF-16 form that expects a BOM to
>> determine endianness. This encoding uses the shift state to remember
>> what endian it is in. (No dynamic switching.)
> The common case is that you have a BOM at the start, and if there are
> any other BOMs they'll be the same. But what I don't know is what the
> Unicode specs allow in this respect, and whether it's sensible to
> provide explicit support for that limited case as well as the more
> general case.
From memory when I was implementing Unicode strings for my web
framework it goes something along these lines.
If an enclosing specification already tells us that it is Unicode and
which encoding (i.e. HTTP and SMTP/MIME have this mechanism) then there
shouldn't be a BOM. There also should never be a BOM anywhere other than
the start of a string/stream/file (if you concatenate you should remove
inner ones). I think some old applications may incorrectly use a BOM as
a zero width break too. You probably want to just filter out all BOMs
and output them in streams etc. only when told to do so.
When decoding UTF-8 it is also useful to check that the character you
just decoded is actually meant to use that number of UTF-8 bytes. For
example, by zero padding you can encode an apostrophe as 2 bytes rather
than 1. There are a number of security exploits centred around this and
getting one means you're dealing with a buggy Unicode encoder at best,
but more likely your software is under attack. I throw an exception to
stop all processing in its tracks if I see this.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk