Subject: Re: [boost] RFC: interest in Unicode codecs?
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2009-02-12 18:32:14
Cory Nelson wrote:
> Is there interest in having a Unicode codec library submitted to Boost?
> Here is what I have (only tested with GCC 4.3 on Debian, and VC++ 2008):
> Right now it is pretty simple to use:
> transcode<utf8, utf16le>(forwarditerator, forwarditeratorend,
> outputiterator, traits [, maximum]);
> transcode<wchar_encoding, utf32be>(inputrange, outputrange, traits [, maximum]);
> There is also a codecvt facet which supports any-to-any.
> Supports UTF-8, UTF-16, and UTF-32, in little or big endian. Has a
> special wchar_encoding that maps to UTF-16 or UTF-32 depending on your
> platform. A traits class controls error handling.
Yes, Boost definitely ought to have Unicode conversion. Yours is not
the first proposal, and there is actually already one implementation
hidden inside another Boost library. I wrote some UTF conversion code
a while ago and was trying to build a more comprehensive character set
library around it but realised after a while that that was too mammoth
a job. So I think that getting _just_ the UTF conversion into Boost,
ASAP, is the right thing to do.
I've had a look at your code. I like that you have implemented what I
called an "error policy". It is wasteful to continuously check the
validity of input that comes from trusted code, but important to check
it when it comes from an untrusted source. However I'm not sure that
your actual conversion is as efficient as it could be. I spent quite a
while profiling my UTF8 conversion and came up with this:
I think that you could largely copy&paste bits of that into the right
places in your algorithm and get a significant speedup.
Having said all that, I must say that I actually use the code that I
wrote quite rarely. I now tend to use UTF8 everywhere and treat it as
a sequence of bytes. Because of the properties of UTF8 I find it's
rare to need to identify individual code points. For example, if I'm
scanning for a matching " or ) I can just look for the next matching
byte, without worrying about where the character boundaries are. If I
were to use a special UTF8-decoding iterator to do that scan I would
waste a lot of time do unnecessary conversions. I'm not sure what
conclusion to draw from that: perhaps just that any "UTF8 string", or
whatever, should come with a health warning that users should first
learn how UTF8 works and review whether or not they actually need it.