|
Boost : |
From: Sebastian Redl (sebastian.redl_at_[hidden])
Date: 2008-03-02 17:29:07
Phil Endecott wrote:
> The common case is that you have a BOM at the start, and if there are
> any other BOMs they'll be the same. But what I don't know is what the
> Unicode specs allow in this respect, and whether it's sensible to
> provide explicit support for that limited case as well as the more
> general case. (Do the IANA character sets names that I'm using as the
> basis for the charset_t enum have any way of distinguishing these
> cases, for example? I think the answer is no.)
>
IANA registers UTF-16BE, UTF-16LE and UTF-16. BE and LE are the
fixed-endian variants. UTF-16 depends on context: if the base unit is a
16-bit entity, UTF-16 is simply endian-agnostic. If it's an 8-bit
entity, I believe UTF-16 requires a BOM.
I don't think flipping endians in the middle of a string is useful. I
can't imagine what twisted tool would generate such code.
Come to think of it, if I'm not careful, *my* code will generate it,
namely when you concatenate a BE and a LE string. Concatenating shift
encodings is *not* fun. Neither is substringing them.
>> Trying to implement this, I've found that it is apparently logically
>> impossible to provide bidirectional iterators for shift encodings, like
>> ISO 2022-based encodings. These encodings rely on state that can only be
>> known by sequentially scanning the string from front to back.
>>
>
> Yes. You may be able to argue in some cases that you can predict the
> state during backward traversal IF there are no redundant shifts and if
> there are only two states. Again, I don't know whether that's useful
> in practice (and I suspect not).
>
Not really. The only shift encodings that ever found use are those of
the ISO 2022 family, which have a two different shift state sets, one
with four and one with three states, for a total of 12 shift states, not
to mention the character set selection capabilities.
Have I mentioned that the complexity of this stuff is absurd?
> We could detect the case when skip_forward_char is not implemented.
>
What I'm currently doing is detecting if state_t is an empty class.
Much, much easier than detecting if a function is implemented or not,
especially if you have a base class that provides a default for the
function.
> There are various factors that influence the adapted iterator traversal
> tag. For example, I wanted to say that the character iterator has the
> same traversal tag as the unit iterator, except that it's not random
> access; i.e. min(unit_iter_t,bidirectional). Is there any existing
> code anywhere for doing operations like this on iterator traversal
> category tags?
>
Not that I know of. I had something like this around for old style
categories, but when I tried to adapt it to the new ones, I realized
that it didn't actually work. (I ended up never using it.)
Sebastian Redl
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk