From: Alexander Grund (alexander.grund_at_[hidden])
Date: 2020-06-18 06:45:52
> I think it has most of what's needed, though it seems that the
> type conversion __builtin_convertvector, which is needed to
> expand e.g. a UTF-8 byte to UTF-32 with zero bytes, is only present
> in newer versions of g++ than I have.
Than it's likely not very useful for now. Maybe later once that compiler
version is more wide-spread
> // Attempt to decode the subset of UTF-8 with code points < 256.
> // Format is either 0xxxxxxx -> 0xxxxxxx
> // or 110---xx 10yyyyyy -> xxyyyyyy
> // The input mustn't start or finish in the middle of a multi-byte
> // character.
> // Other inputs produce undefined outputs.
Good code for that special case. But I think "undefined outputs" is not
acceptable. I've seen other SIMD UTF-8 conversions around and they
basically all focus on ASCII converting as much as possible and fallback
to one-by-one decoding once a non-ascii is found
> That will be quick, but it does lack a few things; it doesn't check if
> it has reached the end of the input and it doesn't do any error checking.
So not really usable either. BUT: Compare to Boost.Locale which has a
`decode` and `decode_valid` function where the latter assumes valid UTF-8
However checking for end-of-input is a must obviously.
BTW: Does Boost.Text have functions or overloads where you can specify
that text is in a specific encoding/normalization?
If not I think this should be added. Sometimes you get text from an
internal function and know those things so you can skip verification and
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk