Boost logo

Boost :

From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2007-09-26 19:08:10


Joseph Gauterin wrote:
> Making the iterator a byte iterator, not a code point iterator, pushes
> the responsibility for knowing how to handle the variable widthness of
> the different encodings back onto the user.

Indeed, and smart users might prefer to take that responsibility
sometimes. For example, if I want to break up a lump of UTF8 text into
lines at each \n then I can just treat it as bytes and look for \n,
since \n never occurs in a multibyte character in UTF8. As another
example, an XML parser can exploit this when looking for its various
punctuation characters. Because a UTF8 character-iterator has the
overhead of determining the character width, and also as variable-width
iterator operations like operator- are not O(1), having the option to
use a byte iterator could be a significant performance help.

Of course you could just use a vector<char> or similar when you want to
do this sort of thing, but that's not great if you want to
mix-and-match byte and character operations without copying the whole string.

I'm wondering about offering distinct "unit" (e.g. byte) and
"character" types in the charset_traits class, and providing separate
unit_iterator and character_iterator types and operations. Or maybe
the character_iterators are best provided by some sort of "adapter" layer?

> IIRC, iconv is licensed under the GPL

The iconv API is a POSIX and SUS standard. There is an implementation
in glibc, which is LGPLed; I believe that other OSes have their own
implementations (including BSD-licensed ones). I thought that it was
included in Windows since NT but Google tells me I'm wrong.

We would certainly want a conversion interface that could be adapted to
std::codecvt, iconv, recode (which is a GNU-only thing), icu, etc. I
have already written functor wrappers for iconv and recode which work
like this:

Iconver latin1_to_utf8("latin1","utf8");
utf8string s = latin1_to_utf8(x);

The functor can store any state for variable-width charsets. Iconv
takes charset names as char*s; I have put a char* name in my
charset_traits class to support this. Something is needed to indicate
policy for conversion problems, e.g. throw or insert '?' when there is
no corresponding character in the target charset. How compatible could
this be made with codecvt and icu?

Thanks for the many replies. Do keep posting. I'm not going to try to
keep up with replies to everything, though; I'm going to try and write
come code!

Regards,

Phil.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk