Boost logo

Boost :

Subject: Re: [boost] [beast] Formal review
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2017-07-11 00:34:04


Andrey Semashev wrote:
> On 07/10/17 19:57, Phil Endecott via Boost wrote:
>> Vinnie Falco wrote:
>>> The utf-8 validation of text websocket message payloads is a critical
>>> bottleneck.
>>
>> I thought I'd have a look at that. Your code has a fast-path that
>> attempts to check 8 bytes at a time by masking with 0x8080808080808080.
>> I agree that that is probably a useful optimisation which a compiler is
>> unlikely to apply by itself.
>
> I wouldn't be so quick to agree. Modern compilers tend to be able to
> vectorize simple loops quite efficiently.

Not this one (I've tried it).

>> Your implementation
>> (
>> https://github.com/vinniefalco/Beast/blob/6f88f0182d461e9cabe55032f966c9ca4f77bf46/include/beast/websocket/detail/utf8_checker.hpp
>> ) does this:
>>
>> if((*reinterpret_cast<std::size_t const*>(in) & mask) != 0)
>>
>> So you're reinterpret_casting a char* to a size_t* and then dereferencing
>> it. Isn't that undefined behaviour?
>
> C++ allows chars and unsigned chars (which std::uint8_t presumably is)
> to alias other types

My understanding is that you can reinterpret_cast FROM other types TO
chars, but not FROM chars TO other types. But there are surely people
here who understand this stuff better than me.

> BTW, the fast loop ends on the first byte >=0x80 and then the validation
> continues in the slow loop until the end of input. It might be better to
> resume the fast loop if validation of the "vector" passes. Otherwise,
> for example, and early comment in the input could ruin the performance.

Yes, I also noticed that; quite a common case in en_GB would be to have
mostly-ASCII with occasional currency symbols.

Vinnie Falco wrote:
>> So you're reinterpret_casting a char* to a size_t* and then dereferencing
>> it. Isn't that undefined behaviour?
>
> Which platform do you think this doesn't work on?

Wrong question. It's likely that it works on current compilers on current
platforms. But if I'm right that it is undefined behaviour then it might
stop working with some future compiler update somewhere.

> The reinterpret_cast<> can be trivially changed to std::memcpy:
>
> std::size_t temp;
> std::memcpy(&temp, in, sizeof(temp
> if((temp & mask) != 0)

Yes, I believe that's the right thing to do.

Regarding the relative performance of different UTF-8 checking code:

I've done a quick test by hacking some of my own UTF-8 decoding code to
just validate, and I see a factor of about 4 improvement between the
byte-at-a-time and an optimised version that checks the top bits of 8 bytes
at a time. The code is below. I measure a throughput of about 1.6 GB/second
for ASCII-only input, on my ARMv8 test system. I'd be interested to hear how
that compares to the other implementations; a very quick test suggests that
it might be a bit quicker than the Beast code but I'd like someone else to
confirm or deny. (It might also be wrong - I've not tested it with non-ASCII
input.)

Note that this is only about 50 lines of code; Beast's utf8_checker.hpp is
maybe 5 times as long. Code follows.

Regards, Phil.

template <typename ITER>
bool is_valid_utf8(ITER i, ITER end)
{
  // Check if range is valid and complete UTF-8.
  // (Maybe incomplete input, i.e. ending in the middle of a multi-byte character, needs
  // to be reported differently?)

  while (i != end) {

    // If i is suitably aligned, do a fast word-at-a-time check for ASCII characters.
    // FIXME this only works if ITER is a contiguous iterator; it needs a "static if".
    const char* p = &(*i);
    const char* e = p + (end-i); // I don't think &(*end) is allowed because it appears to dereference end.
    unsigned long int w; // Should be 32 bits on 32-bit processor and 64 bits on 64-bit processor.
    if (reinterpret_cast<uintptr_t>(p) % sizeof(w) == 0) {
      while (p+sizeof(w) <= e) {
        memcpy(&w,p,sizeof(w));
        if (w & 0x8080808080808080) break; // If any of the top bits are set, fall back to the
                                            // byte-at-a-time code below.
                                            // (Is there a better way to write the mask value that would work
                                            // for e.g. 128-bit ints? Is that expression OK for 32-bit ints?)
        p += sizeof(w);
        i += sizeof(w);
      }
      if (p == e) break;
    }

    uint8_t b0 = *i++;
    if ((b0 & 0x80) == 0) continue; // Single byte chars are 0xxxxxxx
    if (i == end) return false; // Incomplete
    uint8_t b1 = *i++;
    if ((b1 & 0xc0) != 0x80) return false; // Following bytes are all 10xxxxxx
    if ((b0 & 0xe0) == 0xc0) continue; // Two-byte chars start 110xxxxx
    if (i == end) return false; // Incomplete
    uint8_t b2 = *i++;
    if ((b2 & 0xc0) != 0x80) return false; // Following bytes are all 10xxxxxx
    if ((b0 & 0xf0) == 0xe0) continue; // Three-byte chars start 1110xxxx
    if (i == end) return false; // Incomplete
    uint8_t b3 = *i++;
    if ((b3 & 0xc0) != 0x80) return false; // Following bytes are all 10xxxxxx
    if ((b0 & 0xf8) == 0xf0) continue; // Four-byte chars start 11110xxx
    return false; // First byte is invalid, i.e. 10xxxxxx or 11111xxx

  }
  return true;
}


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk