Boost logo

Boost :

Subject: Re: [boost] [Beast] Questions Before Review
From: Niall Douglas (s_sourceforge_at_[hidden])
Date: 2017-06-26 02:12:25


On 26/06/2017 00:54, Vinnie Falco via Boost wrote:
> On Sun, Jun 25, 2017 at 4:42 PM, Niall Douglas via Boost
> <boost_at_[hidden]> wrote:
>
> Rarely I find the need to quote myself, this is one such occasion:
>
>> Using gsl::span<> effectively type-erases the buffer sequence which is
>> inefficient, since it must be copied to a vector equivalent in order
>> to be used.
>
>> Think gsl::span<gsl::span<char>>.
>
> gsl::span<gsl::span<char>> effectively requires the callers to set
> aside storage for an array of std::pair<void const*, std::size_t>. The
> Asio equivalent is std::vector<boost::asio::const_buffer> which as you
> can see is inefficient as std::vector is not cheap to copy.
>
> Asio's ConstBufferSequence concept of course includes
> gsl::span<gsl::span<char>> and std::vector<boost::asio::const_buffer>
> as models but it also includes the types I provided which cannot be
> modeled using gsl::span without intermediate storage. If you feel this
> is an acceptable tradeoff for your library, I am quite happy for you
> but I'll stick with the use of templates and the ConstBufferSequence
> concept since that is what is on track for standardization.

AFIO v2 doesn't allocate memory except exactly once per async i/o
initiated. So the scatter-gather buffer list is given to the OS
immediately, and therefore no copies of that list are needed as the OS
takes them. Most users I should imagine would therefore build
scatter-gather lists on the stack as they'll be thrown away immediately,
indeed I usually feed it curly braced initializer_lists personally,
least typing.

In BEAST's case, I can see that scatter-gather buffers may need to be
kept for longer. Except ...

>> I'd have thought that as the HTTP comes in - in chunks -
>> your code would work with discontiguous storage.
>
> I thought that too and built quite a bit on top of a parser that
> worked in chunks, until I used a profiler and did some comparisons to
> other implementations...
>
>> Are you memory copying instead?
>
> ...when I discovered it is actually much, much faster to memcpy
> discontiguous buffers into a new, allocated buffer so that the parser
> can assume the entire structured portion is in a single "span" (to use
> the terminology you prefer). And of course if the caller already has
> the data in a single buffer then there's no need to memcpy. The class
> beast::flat_buffer is provided for that purpose.

... you're using memcpy to append incoming chunks of TCP data anyway,
obviating the need for scatter-gather.

Now I personally find it quite surprising that you found that memory
copying is faster than a high quality discontinuous storage iterator
implementation with ring buffered page aligned DMA friendly TCP i/o.
Anyone who has ever claimed that to me before, when I examined their
code I found they'd been using a naive RandomAccessIterator
implementation which was doing two pointer indirections per access,
which is obviously going to be not quick. What they should have done was
implement BidirectionalIterator only, and had it cache when it next
needs to do a lookup into the next discontiguous section, thus becoming
very fast indeed because we're not trashing the CPU caches and not
stalling the out of order execution. This is the kind of stuff you need
if you want to get a single CPU core feeding a >= 10 Gbps NIC.

The question then becomes is how expensive is losing random access on
your parser implementation? Cloning these BidirectionalIterators is
extremely quick (three pointers each), plus parsing HTTP tends to work
in offsets anyway. Any XML parser I've ever written I keep a stack of
where I am in the tag hierarchy as I run forwards - so I would reckon a
ForwardIterator would actually be enough in practice.

But I'll grant you that HTML is not XML, it's much harder to parse tag
soup. So you may be right. Nevertheless I think you are going to have to
back up your claim that memory copying all incoming data is faster
rather than being bad implementation techniques with discontinuous
storage, as surely I am not alone here in being surprised at such a claim.

Niall

-- 
ned Productions Limited Consulting
http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk