|
Boost : |
Subject: Re: [boost] [beast] Formal review
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2017-07-11 13:40:07
Vinnie Falco wrote:
>>> The reinterpret_cast<> can be trivially changed to std::memcpy:
>>> ...
>> Yes, I believe that's the right thing to do.
>
> That hurts 32-bit ARM.
I think that's an issue with whatever compiler you're using, not the
architecture; I've just done a quick test with arm-linux-gnueabihf-g++-6
6.3.0 and I get about a 5% speedup by using memcpy.
> There's just an eensy teensy problem, the Beast validator is an
> "online" algorithm. It works with chunks of the entire input sequence
> at a time, sequentially, so there could be a code point that is split
> across the buffer boundary.
Yes, I did notice that but it wasn't clear that it was actually being
used.
> I admit that there is surprisingly large amount of code required just
> to handle this case.
The following code is totally untested.
template <typename ITER>
bool is_valid_utf8(ITER i, ITER end, uint8_t& pending)
{
// Check if range is valid and complete UTF-8.
// pending is used to carry state about an incomplete multi-byte character
// from one call to the next. It should be zero initially and is zero on return if
// the input is not mid-character. After submitting the last chunk the caller
// should check both the return value and pending==0.
// Skip bytes pending from last buffer.
// The number of 1s at the most significant end of the first byte of a multi-byte
// character indicates the total number of bytes in the character. pending is
// this byte, shifted to allow for the number of bytes already seen.
while (pending & 0x80) {
uint8_t b = *i++;
pending = pending<<1;
if ((b & 0xc0) != 0x80) return false; // Must be a 10xxxxxx continuation byte.
if (i == end) return true;
}
pending = 0;
while (i != end) {
// If i is suitably aligned, do a fast word-at-a-time check for ASCII characters.
// FIXME this only works if ITER is a contiguous iterator; it needs a "static if".
const char* p = &(*i);
const char* e = p + (end-i); // I don't think &(*end) is allowed because it appears to dereference end.
unsigned long int w; // Should be 32 bits on 32-bit processor and 64 bits on 64-bit processor.
if (reinterpret_cast<uintptr_t>(p) % sizeof(w) == 0) {
while (p+sizeof(w) <= e) {
memcpy(&w,p,sizeof(w));
if (w & 0x8080808080808080) break; // If any of the top bits are set, fall back to the
// byte-at-a-time code below.
// (Is there a better way to write the mask value that would work
// for e.g. 128-bit ints? Is that expression OK for 32-bit ints?)
p += sizeof(w);
i += sizeof(w);
}
if (p == e) break;
}
uint8_t b0 = *i++;
if ((b0 & 0x80) == 0) continue; // Single byte chars are 0xxxxxxx
if ((b0 & 0xc0) == 0x80) return false; // 10xxxxxx not allowed as first byte of character
if ((b0 & 0xf8) == 0xf8) return false; // 11111xxx is not valid
// At this point, we know b0 is a valid first-byte
if (i == end) { // Incomplete input
pending = b0 << 1; // 1 byte seen so far, rest are pending.
return true;
}
uint8_t b1 = *i++;
if ((b1 & 0xc0) != 0x80) return false; // Following bytes are all 10xxxxxx
if ((b0 & 0xe0) == 0xc0) continue; // Two-byte chars start 110xxxxx
if (i == end) { // Incomplete input
pending = b0 << 2; // 2 bytes seen so far, rest are pending
return true;
}
uint8_t b2 = *i++;
if ((b2 & 0xc0) != 0x80) return false; // Following bytes are all 10xxxxxx
if ((b0 & 0xf0) == 0xe0) continue; // Three-byte chars start 1110xxxx
if (i == end) { // Incomplete input
pending = b0 << 3; // 3 bytes seen so far, rest are pending
return true;
}
uint8_t b3 = *i++;
if ((b3 & 0xc0) != 0x80) return false; // Following bytes are all 10xxxxxx
if ((b0 & 0xf8) == 0xf0) continue; // Four-byte chars start 11110xxx
return false; // Not reached, I think.
}
return true;
}
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk