Boost logo

Boost :

From: Niall Douglas (s_sourceforge_at_[hidden])
Date: 2024-07-19 16:29:01


On 19/07/2024 17:12, Christian Mazakas via Boost wrote:
> On Thu, Jul 18, 2024 at 2:47 PM Niall Douglas via Boost <
> boost_at_[hidden]> wrote:
>
>> Instead of over-allocating and wasting a page, I would put the link
>> pointers at the end and slightly reduce the maximum size of the i/o
>> buffer. This kinda is annoying to look at because the max buffer fill is
>> no longer a power of two, but in terms of efficiency it's the right call.
>>
>
> Hey, this is actually a good idea. I had similar thoughts when I was
> designing it.
>
> I can give benchmarking it a shot and see what the results are.

It'll be a bit faster due to reduced TLB pressure :)

> What kind of benchmark do you think would be the best test here? I suppose
> one
> thing I should try is a multishot recv benchmark with many small buffers
> and a large
> amount of traffic to send. Probably just max out the size of a buf_ring,
> which is only
> like 32k buffers anyway.
>
> Ooh, we can even try page-aligning the buffers too.

The first one I always start with is "how much bandwidth can I transfer
using a single kernel thread?"

The second one is how small the write quantum can I use to still max out
bandwidth from a single kernel thread.

It's not dissimilar to tuning for file i/o, there is a bandwidth-latency
tradeoff and latency is proportional to i/o quantum. If you can get the
i/o quantum down without overly affecting bandwidth, that has huge
beneficial effects on i/o latency, particularly in terms of a nice
flat-ish latency distribution.

>> Surely for reading you want io_uring to tell you the buffers, and when
>> you're done, you immediately push them back to io_uring? So no need to
>> keep buffer lists except for the write buffers?
>>
>
> You'd think so, but there's no such thing as a free lunch.
>
> When it comes to borrowing the buffers, to do any meaningful work you'll
> have to
> either allocate and memcpy the incoming buffers so you can then immediately
> release
> them back to the ring or you risk buffer starvation.
>
> This is because not all protocol libraries are designed to copy their input
> from you and
> they require the caller use stable storage. Beast is like this and I think
> zlib is too. There's
> no guarantee across protocol libraries that they'll reliably copy your
> input for you.
>
> The scheme I chose is one where users own the returned buffer sequence and
> this enables
> nice things like an in-place TLS decryption, which I use via Botan. This
> reminds me, I use
> Botan in order to provide a generally much stronger TLS interface than
> Asio's.

Oh okay. io_uring permits 4096 locked i/o buffers per ring. I put
together a bit of C++ metaprogramming which encourages users to release
i/o buffers as soon as possible, but if they really want to hang onto a
buffer, they can.

If we run out of buffers, I stall new i/o until new buffers appear. I
then have per-op TSC counts so if we spend too much time stalling new
i/o, the culprits holding onto buffers for too long can be easily
identified.

I reckon this the least worst of the approaches before us - well behaved
code gets maximum performance, less well behaved code gets less
performance. But everything is reliable.

If you think this model through, the most efficient implementation
requirement is that all work must always be suspend-resumable because
any work can be suspended at any time due to temporary lack of
resources. In other words, completion callbacks won't cut it here.

Niall


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk