Boost logo

Boost :

Subject: Re: [boost] [beast] Formal review
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2017-07-11 13:40:07


Vinnie Falco wrote:
>>> The reinterpret_cast<> can be trivially changed to std::memcpy:
>>> ...
>> Yes, I believe that's the right thing to do.
>
> That hurts 32-bit ARM.

I think that's an issue with whatever compiler you're using, not the
architecture; I've just done a quick test with arm-linux-gnueabihf-g++-6
6.3.0 and I get about a 5% speedup by using memcpy.

> There's just an eensy teensy problem, the Beast validator is an
> "online" algorithm. It works with chunks of the entire input sequence
> at a time, sequentially, so there could be a code point that is split
> across the buffer boundary.

Yes, I did notice that but it wasn't clear that it was actually being
used.

> I admit that there is surprisingly large amount of code required just
> to handle this case.

The following code is totally untested.

template <typename ITER>
bool is_valid_utf8(ITER i, ITER end, uint8_t& pending)
{
  // Check if range is valid and complete UTF-8.
  // pending is used to carry state about an incomplete multi-byte character
  // from one call to the next. It should be zero initially and is zero on return if
  // the input is not mid-character. After submitting the last chunk the caller
  // should check both the return value and pending==0.

  // Skip bytes pending from last buffer.
  // The number of 1s at the most significant end of the first byte of a multi-byte
  // character indicates the total number of bytes in the character. pending is
  // this byte, shifted to allow for the number of bytes already seen.
  while (pending & 0x80) {
    uint8_t b = *i++;
    pending = pending<<1;
    if ((b & 0xc0) != 0x80) return false; // Must be a 10xxxxxx continuation byte.
    if (i == end) return true;
  }

  pending = 0;

  while (i != end) {

    // If i is suitably aligned, do a fast word-at-a-time check for ASCII characters.
    // FIXME this only works if ITER is a contiguous iterator; it needs a "static if".
    const char* p = &(*i);
    const char* e = p + (end-i); // I don't think &(*end) is allowed because it appears to dereference end.
    unsigned long int w; // Should be 32 bits on 32-bit processor and 64 bits on 64-bit processor.
    if (reinterpret_cast<uintptr_t>(p) % sizeof(w) == 0) {
      while (p+sizeof(w) <= e) {
        memcpy(&w,p,sizeof(w));
        if (w & 0x8080808080808080) break; // If any of the top bits are set, fall back to the
                                            // byte-at-a-time code below.
                                            // (Is there a better way to write the mask value that would work
                                            // for e.g. 128-bit ints? Is that expression OK for 32-bit ints?)
        p += sizeof(w);
        i += sizeof(w);
      }
      if (p == e) break;
    }

    uint8_t b0 = *i++;
    if ((b0 & 0x80) == 0) continue; // Single byte chars are 0xxxxxxx

    if ((b0 & 0xc0) == 0x80) return false; // 10xxxxxx not allowed as first byte of character
    if ((b0 & 0xf8) == 0xf8) return false; // 11111xxx is not valid
                                            // At this point, we know b0 is a valid first-byte

    if (i == end) { // Incomplete input
      pending = b0 << 1; // 1 byte seen so far, rest are pending.
      return true;
    }

    uint8_t b1 = *i++;
    if ((b1 & 0xc0) != 0x80) return false; // Following bytes are all 10xxxxxx
    if ((b0 & 0xe0) == 0xc0) continue; // Two-byte chars start 110xxxxx

    if (i == end) { // Incomplete input
      pending = b0 << 2; // 2 bytes seen so far, rest are pending
      return true;
    }

    uint8_t b2 = *i++;
    if ((b2 & 0xc0) != 0x80) return false; // Following bytes are all 10xxxxxx
    if ((b0 & 0xf0) == 0xe0) continue; // Three-byte chars start 1110xxxx

    if (i == end) { // Incomplete input
      pending = b0 << 3; // 3 bytes seen so far, rest are pending
      return true;
    }

    uint8_t b3 = *i++;
    if ((b3 & 0xc0) != 0x80) return false; // Following bytes are all 10xxxxxx
    if ((b0 & 0xf8) == 0xf0) continue; // Four-byte chars start 11110xxx

    return false; // Not reached, I think.

  }
  return true;

}


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk