Boost logo

Boost :

From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2020-06-15 19:05:24


Dear All,

I have been looking at the UTF-8 decoding code in the proposed
Boost.Text, as this is a problem I've looked at myself in the past.
I've mentioned an issue with the copyright in another message.
Here are my other observations.

1. The SIMD code is x86-specific. It doesn't need to be; I think
it could use gcc's vector builtins to do the same thing and be
portable to other SIMD implementations. (Clang provides the same
builtins; I'm not sure about what you need to do on MSVC/Windows.)
See: https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html

2. The SIMD code only seems to provide a fast path for bytes < 0x80,
falling back to sequential code for everything else. I guess I was
expecting something more sophisticated.

3. The code used for bytes >= 0x80, and in all cases for non-x86,
is here:
https://github.com/tzlaine/text/blob/master/include/boost/text/transcode_iterator.hpp
around lines 400-560. It implements a state machine, which surprises
me; it takes much less code and gives better performance if you write
out the bit-testing and shifting etc. explicitly. This seems to be
about 50% slower than my existing UTF-8 decoding code.

4. There aren't enough comments anywhere in the code I've looked at!

Regards, Phil.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk