Boost logo

Boost :

Subject: Re: [boost] [string] proposal
From: Dean Michael Berris (mikhailberis_at_[hidden])
Date: 2011-01-29 16:18:35


On Sun, Jan 30, 2011 at 4:39 AM, Steven Watanabe <watanabesj_at_[hidden]> wrote:
> AMDG
>
> On 1/29/2011 1:25 AM, Dean Michael Berris wrote:
>>
>> On Sat, Jan 29, 2011 at 6:53 AM, Steven Watanabe<watanabesj_at_[hidden]>
>>  wrote:
>>>
>>> This allocator gets more memory from the OS as needed,
>>> but usually doesn't release it back to the OS immediately.
>>
>> On Sun the implementation of malloc/free are like this. But on Linux
>> and other OSes as far as I can tell a call to free will actually tell
>> the system memory manager that the segment(s) or page(s) in question
>> are available.
>>
>
> Wrong.  It will only do so if it can, which is
> definitely not all the time.  If you try a
> naive benchmark, it probably will look like
> it always releases the memory.
>

I wasn't implying all the time that it will release the memory. Only
if the segment is actually unused will it be done so.

>>> * If sbrk fails, use mmap to allocate a 1 MB chunk.
>>>
>>
>> And there we go.
>>
>
> ???
> 1 MB is usually a lot bigger than a page.  The allocator
> itself is asking the system for chunks far bigger than the
> sizes that you seem to be so worried about.  I don't
> understand what you're trying to point out here.
>

Right. I didn't know what I was thinking.

>>> Note that the OS never gets a mmap request less than
>>> 128 KB and all the memory allocated via sbrk is always
>>> contiguous.
>>>
>>
>> Yes, and this is the problem with sbrk: if you rely on your allocator
>> to use sbrk, sbrk is not guaranteed to just expand the available
>> segment in place
>
> Yes it is.  Its interface makes it impossible for it
> to do anything else.
>
>       #include <unistd.h>
>
>       int brk(void *addr);
>       void *sbrk(intptr_t increment);
>
> brk()  and  sbrk()  change  the  location  of  the program break,
> which defines the end of the process's data segment (i.e., the program
> break is the first location after the end of the uninitialized data
> segment).
>

You're right, I'm wrong.

>
>> -- which means the VMM no matter what you do will
>> actually have to find the memory for you and give you that chunk of
>> memory. This means potentially swapping pages in/out. You limit this
>> likelihood by asking for page-sized and page-aligned chunks and using
>> those properly;
>
> Why?
>

Why will page-sized/page-aligned chunks avoid the likelihood of
(unnecessarily) swapping pages in and out? Because if you didn't take
care that your data would spill over to page boundaries the amount of
pages you're touching is one more than you need to touch.

>> making data in these chunks immutable makes for much
>> more sane "shareability" and a much more predictable performance
>> profile.
>>
>
> This is irrelevant to the discussion at hand.
> I don't want to get into anything except the
> effects of allocating contiguous buffers, since
> that seems to be quite enough.
>

Ok.

>> The problem is then compounded by when your program calls sbrk/brk in
>> one core and fails -- then mmap is called and the memory returned by
>> mmap will actually be in the memory module that's handled by a NUMA
>> CPU. Now imagine using these chunks from another thread that's not
>> running on the same core and you go downhill from there. That doesn't
>> even factor the cache hits/misses that can kill your performance
>> especially if the VMM has to page in/out memory that has been evicted
>> from the L3 cache of the processors (if there's even an L3).
>>
>
> irrelevant.
>

Ok.

>> All of this stems from the fact that you cannot rely on there being
>> enough memory at the start of the program at any given point. Thinking
>> about it differently helps this way: if you assume that every memory
>> allocation you make will cause a page fault and will potentially cause
>> the VMM to allocate you a page, then you should be using this page of
>> memory properly and saving on the amount of page faults you generate.
>
> This assumption doesn't necessarily hold.  In fact, the library's
> allocator probably tries to avoid it.  Anyway, even with
> this assumption we have a single 5 page allocation -> 5
> page faults, 5 separate 1 page allocations -> 5 page faults.
> How does allocating only a page at a time save you anything?
>

If you didn't take care to allocate the right-sized chunks you might
be allocating memory that would spill over to more than just one page.
This means you may be touching more pages you want and are causing one
more page fault than necessary.

>>
>> It is if you have 10,000 concurrent connections and you have to
>> allocate somewhere around that same amount of data for each
>> connection's "buffer".
>>
>
> So why is it a problem if the buffers are
> contiguous?

It's not if it fits in one page or in the minimum amount of pages. You
limit the page faults and thus the paging.

> I don't see why the memory
> manager can't handle this in a reasonably
> efficient way.
>

If you touch two pages instead of one the likelihood that you have
100% more page faults is higher.

>>
>> Yes, but the thing is, you don't have to get to that point for your
>> VMM to start swapping things in/out of memory. Especially on NUMA
>> architectures, how Linux behaves is pretty surprising, and that's
>> another can of worms you don't wanna get into. ;)
>>
>
> Of course, but my point was that there
> will always be a sufficiently large
> contiguous address range.
>

Okay.

>>
>> It actually is. Note that there's a hardware page size and a virtual
>> page size. Bus bandwidths matter as well regarding how much data you
>> can actually pull from main memory into a CPU's cache. These factors
>> put a predictable upper bound to what you can do efficiently on a
>> given machine. Of course these are things you cannot control a priori
>> in your C++ application. This also is a matter of how well your OS
>> does virtual memory paging.
>>
>
> You still have to allocate just as much memory whether
> it's contiguous or not.  The page size is not directly
> correlated to what the allocator can handle efficiently.
>

Yes, now if your allocations didn't play along with the page sizes you
could be touching more pages than you need to be touching. Let's say
you have an std::list of things and each node sat on page boundaries,
then touching each node would cause two page faults instead of just
one.

>> BUT... what you can control is how you ask for memory from the system.
>> This means if you can ask for a page-aligned and page-sized worth of
>> memory -- by knowing sufficiently well enough how big a "page" is at
>> runtime or a-priori -- then that's better memory usage all around
>> playing well with the VMM and your caches.
>>
>
> You keep asserting that this is better, but you
> still haven't explained why except from a flawed
> understanding of how the memory manager works.  If
> you just use malloc or new, allocating 4096 bytes
> or whatever the page size is, will probably not give
> you what you want, because the memory manager adds
> its own header, so the actual memory block
> will take up slightly more than a page (In
> an implementation dependent way).
>

It's really simple. If your allocator is stupid, then you can't do
much about that right?

But if your allocator was smart and if you asked for memory in
chunk-sizes that are equal to or close to page-sized chunks, it can
give you a page-aligned address. That makes a world of difference in
limiting the number of potential page faults you have than getting
data in a non-page-aligned manner in a larger-than-a-page chunk of
data.

Of course if you really wanted contiguous data then by all means go
get a contiguous block of N pages. But that's not the case I'm talking
about because I'm interested in the case when I don't *need* a
contiguous block of pages.

>>
>> Right, so given this context, you're saying that using a sane memory
>> model wouldn't help?
>>
>
> I disagree with what you consider sane.
>

Alright, fair enough.

>>
>> This is the hidden cost of mutability in these situations, and how
>> asking for contiguous virtual memory pages will kill your performance.
>>
>
> What does this have to do with contiguity?
>

Page-boundary and cross-page access potentially causing more page
faults than actually necessary.

>> Try a simple benchmark: malloc a 1GB randomly generated file into a
>> 64-bit multi-core Intel nehalem based processor, spawn 1,000 threads
>> on Linux and randomly on each thread change a value in a randomly
>> picked offset. This will show you how cache misses, VMM paging, and
>> the cost of mutability of data will actually kill your efficiency. Now
>> if you change the program to just randomly read from a part of memory,
>> you'll see how immutability helps. Then change it yet again to malloc
>> 1GB worth of data but this time on multiple files breaking them files
>> up into smaller pieces. The dramatic improvement in performance should
>> be evident there. :)
>>
>
> a) Why use a file, instead of simply allocating that much memory?

Because mmap does demand-loading by default -- it's not until you
actually access the memory does a page fault actually happen. This
gives you an idea how the page faults actually affect the solution.

> b) Intentionally bad memory usage is not a good benchmark.

Random access to a huge chunk of memory is not intentionally bad
memory usage -- that's a pathological case for caches and storage
services like RDBMS etc.

> c) I might actually try it if you provide source code so
>   I can be sure that we're talking about the same thing.
>

I just might do that later on. ;)

>>
>> The paper does assume an allocator that knows how to use pages wisely.
>
> What do you mean by "use pages wisely?"  It isn't
> just a matter of having a reasonable allocator.  You
> would need an allocator specifically tuned for your
> usage.
>

Use pages wisely: layout a tree for example on a page of memory or in
as few pages as possible. Put chunks of data in a page-aligned block.
Things like these.

If you can tell the allocator to give you a page-aligned or page-sized
chunk of data then you can go ahead and do the right thing with the
data limiting the chances that you're causing unnecessary page faults.

>> Vicente Botet has a nice allocplus proposal which IMO is a good/better
>> way to go than the currently constrained allocators defined in C++03.
>> Maybe that will make it to C++0x, I haven't checked yet.
>>
>
> You mean Ion, I think.  Anyway, I don't see how this
> is relevant.
>

Yes, I meant Ion. Sorry about that.

If that allocator knew how long a page is at runtime and allowed for
page-aligned chunks to be reserved as a "free store" where you used
some sort of bin-packing heuristic, then it can limit the page faults
induced when users of the data in the allocator actually access memory
in these areas provided.

>>
>> Sure, so how do you discourage that from being encouraged if all the
>> data structures you can share across threads are mutable?
>>
>
> If you want to use a pure functional language,
> go ahead.  I do not believe in restricting features
> because they /can/ be abused under /some/
> circumstances.  Strings don't have to be shared
> across threads.  Besides, from what I saw, your
> string isn't really immutable.  It just doesn't
> allow mutable access to characters.
>

There are many things that don't have to be shared across threads but
unfortunately they do get shared.

And no I don't want to use a pure functional language.

If the type didn't allow mutable access to characters, how is it not
an immutable string?

>>> There's nothing special about strings.
>>> If you want to share immutable data, declare it
>>> const.  Problem solved.  No need to force all
>>> strings to be immutable just for this.
>>>
>>
>> const has not prevented anybody from const_cast<...>'ing anything.
>> It's not sufficient that your immutability guarantee only comes from a
>> language feature that can even be overridden.
>
> Sorry, but I have little sympathy for those
> who abuse const_cast.
>

You and I both. But then it's a language feature.

>> And even if an object is
>> const, an implementor of that object's type has the mutable keyword
>> giving you a false sense of security in making calls even to const
>> member functions of that type.
>>
>
> Well don't do that then.  Declaring that a
> type is immutable doesn't make you any safer
> against this.
>

Which isn't what I'm going to do, you're right. But I was pointing out
that nobody's stopping anybody from const_casting and making things
mutable.

>> Unfortunately as much as I would like to agree with the statement
>> "there's nothing special about strings", a large body of computer
>> science algorithms dealing with strings primarily (think compression
>> algorithms, distance algorithms, regex matching, parsing, etc.) show
>> otherwise. Strings are an appropriate abstraction for some kinds of
>> data and this is especially true if you're dealing with text-based
>> network protocols and user-generated data.
>>
>
> Huh?  Why does that mean that strings should
> be treated differently from everything else?
>

Because there are special cases and algorithms that only make sense for strings?

-- 
Dean Michael Berris
about.me/deanberris

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk