![]() |
Boost : |
From: Sebastian Redl (sebastian.redl_at_[hidden])
Date: 2008-03-02 17:29:07
Phil Endecott wrote:
> The common case is that you have a BOM at the start, and if there are
> any other BOMs they'll be the same. But what I don't know is what the
> Unicode specs allow in this respect, and whether it's sensible to
> provide explicit support for that limited case as well as the more
> general case. (Do the IANA character sets names that I'm using as the
> basis for the charset_t enum have any way of distinguishing these
> cases, for example? I think the answer is no.)
>
IANA registers UTF-16BE, UTF-16LE and UTF-16. BE and LE are the
fixed-endian variants. UTF-16 depends on context: if the base unit is a
16-bit entity, UTF-16 is simply endian-agnostic. If it's an 8-bit
entity, I believe UTF-16 requires a BOM.
I don't think flipping endians in the middle of a string is useful. I
can't imagine what twisted tool would generate such code.
Come to think of it, if I'm not careful, *my* code will generate it,
namely when you concatenate a BE and a LE string. Concatenating shift
encodings is *not* fun. Neither is substringing them.
>> Trying to implement this, I've found that it is apparently logically
>> impossible to provide bidirectional iterators for shift encodings, like
>> ISO 2022-based encodings. These encodings rely on state that can only be
>> known by sequentially scanning the string from front to back.
>>
>
> Yes. You may be able to argue in some cases that you can predict the
> state during backward traversal IF there are no redundant shifts and if
> there are only two states. Again, I don't know whether that's useful
> in practice (and I suspect not).
>
Not really. The only shift encodings that ever found use are those of
the ISO 2022 family, which have a two different shift state sets, one
with four and one with three states, for a total of 12 shift states, not
to mention the character set selection capabilities.
Have I mentioned that the complexity of this stuff is absurd?
> We could detect the case when skip_forward_char is not implemented.
>
What I'm currently doing is detecting if state_t is an empty class.
Much, much easier than detecting if a function is implemented or not,
especially if you have a base class that provides a default for the
function.
> There are various factors that influence the adapted iterator traversal
> tag. For example, I wanted to say that the character iterator has the
> same traversal tag as the unit iterator, except that it's not random
> access; i.e. min(unit_iter_t,bidirectional). Is there any existing
> code anywhere for doing operations like this on iterator traversal
> category tags?
>
Not that I know of. I had something like this around for old style
categories, but when I tried to adapt it to the new ones, I realized
that it didn't actually work. (I ended up never using it.)
Sebastian Redl
Boost list run by bdawes at acm.org, david.abrahams at rcn.com, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk