Boost logo

Boost :

From: Zach Laine (whatwasthataddress_at_[hidden])
Date: 2020-06-16 16:09:34


On Mon, Jun 15, 2020 at 2:06 PM Phil Endecott via Boost
<boost_at_[hidden]> wrote:
>
> Dear All,
>
> I have been looking at the UTF-8 decoding code in the proposed
> Boost.Text, as this is a problem I've looked at myself in the past.
> I've mentioned an issue with the copyright in another message.
> Here are my other observations.
>
> 1. The SIMD code is x86-specific. It doesn't need to be; I think
> it could use gcc's vector builtins to do the same thing and be
> portable to other SIMD implementations. (Clang provides the same
> builtins; I'm not sure about what you need to do on MSVC/Windows.)
> See: https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html

That page describes vector-friendly data types and arithmetic
operations. It does not seem to support the operations actually used
in the code currently in Boost.Text.

> 2. The SIMD code only seems to provide a fast path for bytes < 0x80,
> falling back to sequential code for everything else. I guess I was
> expecting something more sophisticated.

The code makes the fast path extra fast, but the slow path, being
quite branchy, is not really amenable to vectorization. If you have
an implementation that proves that claim false, I'm happy to use it.

> 3. The code used for bytes >= 0x80, and in all cases for non-x86,
> is here:
> https://github.com/tzlaine/text/blob/master/include/boost/text/transcode_iterator.hpp
> around lines 400-560. It implements a state machine, which surprises
> me; it takes much less code and gives better performance if you write
> out the bit-testing and shifting etc. explicitly. This seems to be
> about 50% slower than my existing UTF-8 decoding code.

Could you point me to that code, and let me use your benchmarks to
verify? I'm happy to do something faster!

> 4. There aren't enough comments anywhere in the code I've looked at!

I only put comments where something unclear or unexpected is
happening. The intention is that the rest of the code is clear enough
to read on its own. Particularly in the case of Boost.Text, where
most of the code follows one or more Unicode specifications, I tend to
put a comment indicating where the online description of an algorithm
might be found, and that's it -- except for API docs, of course.

Zach


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk