Boost logo

Boost :

From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2008-03-02 15:53:12


Sebastian Redl wrote:
> It gets worse. I've tried to implement a very simple "kinda-shift"
> encoding, UTF-16VE. That is, a UTF-16 form that expects a BOM to
> determine endianness. This encoding uses the shift state to remember
> what endian it is in. (No dynamic switching.)

The common case is that you have a BOM at the start, and if there are
any other BOMs they'll be the same. But what I don't know is what the
Unicode specs allow in this respect, and whether it's sensible to
provide explicit support for that limited case as well as the more
general case. (Do the IANA character sets names that I'm using as the
basis for the charset_t enum have any way of distinguishing these
cases, for example? I think the answer is no.)

> Trying to implement this, I've found that it is apparently logically
> impossible to provide bidirectional iterators for shift encodings, like
> ISO 2022-based encodings. These encodings rely on state that can only be
> known by sequentially scanning the string from front to back.

Yes. You may be able to argue in some cases that you can predict the
state during backward traversal IF there are no redundant shifts and if
there are only two states. Again, I don't know whether that's useful
in practice (and I suspect not).

> Any
> attempt to iterate backwards would first have to mark the switch
> positions and what modes they switch from.
>
> This can be worked around for my UTF-16VE, but not for true shift
> encodings. Thus, the charset traits probably need a flag that designates
> the set as a shift encoding and makes the iterator adapter be forward-only.

We could detect the case when skip_forward_char is not implemented.

There are various factors that influence the adapted iterator traversal
tag. For example, I wanted to say that the character iterator has the
same traversal tag as the unit iterator, except that it's not random
access; i.e. min(unit_iter_t,bidirectional). Is there any existing
code anywhere for doing operations like this on iterator traversal
category tags?

> On a side note, Shift-JIS, EUC-JP and ISO-2022-JP are all absurdly
> complex. UTF-8 is so much easier!

Agreed :-(

Phil.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk