Subject: Re: [boost] [string] proposal
From: Steven Watanabe (watanabesj_at_[hidden])
Date: 2011-01-29 15:39:44
On 1/29/2011 1:25 AM, Dean Michael Berris wrote:
> On Sat, Jan 29, 2011 at 6:53 AM, Steven Watanabe<watanabesj_at_[hidden]> wrote:
>> This allocator gets more memory from the OS as needed,
>> but usually doesn't release it back to the OS immediately.
> On Sun the implementation of malloc/free are like this. But on Linux
> and other OSes as far as I can tell a call to free will actually tell
> the system memory manager that the segment(s) or page(s) in question
> are available.
Wrong. It will only do so if it can, which is
definitely not all the time. If you try a
naive benchmark, it probably will look like
it always releases the memory.
>> In ptmalloc2 which is used be glibc, the default settings
>> work something like this:
>> * If the requested size is at least 128 KB use mmap
>> to let the OS handle the request directly.
>> * If there is a block available that's big enough, use it.
>> * try to use sbrk to expand the heap.
>> * If sbrk fails, use mmap to allocate a 1 MB chunk.
> And there we go.
1 MB is usually a lot bigger than a page. The allocator
itself is asking the system for chunks far bigger than the
sizes that you seem to be so worried about. I don't
understand what you're trying to point out here.
>> Note that the OS never gets a mmap request less than
>> 128 KB and all the memory allocated via sbrk is always
> Yes, and this is the problem with sbrk: if you rely on your allocator
> to use sbrk, sbrk is not guaranteed to just expand the available
> segment in place
Yes it is. Its interface makes it impossible for it
to do anything else.
int brk(void *addr);
void *sbrk(intptr_t increment);
brk() and sbrk() change the location of the program break,
which defines the end of the process's data segment (i.e., the program
break is the first location after the end of the uninitialized data
> -- which means the VMM no matter what you do will
> actually have to find the memory for you and give you that chunk of
> memory. This means potentially swapping pages in/out. You limit this
> likelihood by asking for page-sized and page-aligned chunks and using
> those properly;
> making data in these chunks immutable makes for much
> more sane "shareability" and a much more predictable performance
This is irrelevant to the discussion at hand.
I don't want to get into anything except the
effects of allocating contiguous buffers, since
that seems to be quite enough.
> The problem is then compounded by when your program calls sbrk/brk in
> one core and fails -- then mmap is called and the memory returned by
> mmap will actually be in the memory module that's handled by a NUMA
> CPU. Now imagine using these chunks from another thread that's not
> running on the same core and you go downhill from there. That doesn't
> even factor the cache hits/misses that can kill your performance
> especially if the VMM has to page in/out memory that has been evicted
> from the L3 cache of the processors (if there's even an L3).
> All of this stems from the fact that you cannot rely on there being
> enough memory at the start of the program at any given point. Thinking
> about it differently helps this way: if you assume that every memory
> allocation you make will cause a page fault and will potentially cause
> the VMM to allocate you a page, then you should be using this page of
> memory properly and saving on the amount of page faults you generate.
This assumption doesn't necessarily hold. In fact, the library's
allocator probably tries to avoid it. Anyway, even with
this assumption we have a single 5 page allocation -> 5
page faults, 5 separate 1 page allocations -> 5 page faults.
How does allocating only a page at a time save you anything?
>>>> Humph. 9 KB is not all that much memory.
>>>> Besides, how often do you have strings
>>>> this big anyway?
>>> When you're downloading HTML pages and/or building HTML pages, or
>>> parsing HTTP requests, it happens a lot. 9kb is a number that's
>>> suitable to show that you need multiple contiguous pages for any
>>> meaningful discussion on what memory paging effects have something to
>>> do with strings and algorithms that deal with strings. Of course this
>>> assumes 4kb pages too, the formula would be the same though with just
>>> values changing for any platform. ;)
>> How big are the strings you're really dealing with?
> There's really no way to know up front especially with HTTP traffic
> that's coming in as chunk-encoded bytes. ;)
>> 9 KB is just insignificant even in a 32 (31?)-bit virtual
>> address space.
> It is if you have 10,000 concurrent connections and you have to
> allocate somewhere around that same amount of data for each
> connection's "buffer".
So why is it a problem if the buffers are
contiguous? I don't see why the memory
manager can't handle this in a reasonably
>> As far as I'm concerned, it isn't big
>> until you're talking megabytes, at least.
> We're not just talking about one string -- I'm talking about several
> tens of thousands of these potentially large strings.
>> With 64-bit
>> you can exhaust physical memory and use the entire
>> hard drive for swap and still be an order of magnitude
>> short of using up all the virtual address space, even
>> when the hardware restricts you to 48-bit virtual addresses.
> Yes, but the thing is, you don't have to get to that point for your
> VMM to start swapping things in/out of memory. Especially on NUMA
> architectures, how Linux behaves is pretty surprising, and that's
> another can of worms you don't wanna get into. ;)
Of course, but my point was that there
will always be a sufficiently large
contiguous address range.
>>> But the effect of finding the 5 contiguous pages is what I was trying
>>> to highlight. Even if you could grow the original buffers of one of
>>> the strings, in a std::string it would still have to be a contiguous
>>> chunk of bytes anyway. :)
>> The page size is simply irrelevant W.R.T.
>> the effect of allocating contiguous memory.
> It actually is. Note that there's a hardware page size and a virtual
> page size. Bus bandwidths matter as well regarding how much data you
> can actually pull from main memory into a CPU's cache. These factors
> put a predictable upper bound to what you can do efficiently on a
> given machine. Of course these are things you cannot control a priori
> in your C++ application. This also is a matter of how well your OS
> does virtual memory paging.
You still have to allocate just as much memory whether
it's contiguous or not. The page size is not directly
correlated to what the allocator can handle efficiently.
> BUT... what you can control is how you ask for memory from the system.
> This means if you can ask for a page-aligned and page-sized worth of
> memory -- by knowing sufficiently well enough how big a "page" is at
> runtime or a-priori -- then that's better memory usage all around
> playing well with the VMM and your caches.
You keep asserting that this is better, but you
still haven't explained why except from a flawed
understanding of how the memory manager works. If
you just use malloc or new, allocating 4096 bytes
or whatever the page size is, will probably not give
you what you want, because the memory manager adds
its own header, so the actual memory block
will take up slightly more than a page (In
an implementation dependent way).
>>> If we're going to go through "in my experience most programs" I'd say
>>> this happens pretty often. ;)
>> Well, looking at my system right now, the
>> most memory that any process is using is
>> 420 MB. This is still a lot less than
>> the 2 GB limit. The next memory user
>> isn't even close (180 MB) and after that
>> it drops off rapidly. If you're getting
>> this close to running out of address space,
>> contiguous strings are probably the least
>> of your problems anyway. There's no way
>> the 5 contiguous pages make any noticeable
>> difference whatsoever.
> Right, so given this context, you're saying that using a sane memory
> model wouldn't help?
I disagree with what you consider sane.
>>> The fact that contiguous pages don't get stored in contiguous physical
>>> memory compounds the problem even more.
> On NUMA machine with 4 memory controllers total sitting on 4 cores,
> imagine a page table where 4 contiguous pages of virtual memory are
> mapped across each of these 4 controllers evenly. Now if you're not
> making mutations you're saving yourself the round-trip from the CPU's
> write-through cache to the appropriate memory controller and
> potentially invalidating cached pages on those controllers as well.
> This is how NUMA changes the equation and how contiguous pages give
> you the "illusion" of contiguity.
> This is the hidden cost of mutability in these situations, and how
> asking for contiguous virtual memory pages will kill your performance.
What does this have to do with contiguity?
> Try a simple benchmark: malloc a 1GB randomly generated file into a
> 64-bit multi-core Intel nehalem based processor, spawn 1,000 threads
> on Linux and randomly on each thread change a value in a randomly
> picked offset. This will show you how cache misses, VMM paging, and
> the cost of mutability of data will actually kill your efficiency. Now
> if you change the program to just randomly read from a part of memory,
> you'll see how immutability helps. Then change it yet again to malloc
> 1GB worth of data but this time on multiple files breaking them files
> up into smaller pieces. The dramatic improvement in performance should
> be evident there. :)
a) Why use a file, instead of simply allocating that much memory?
b) Intentionally bad memory usage is not a good benchmark.
c) I might actually try it if you provide source code so
I can be sure that we're talking about the same thing.
>> I understand perfectly well that avoiding swapping
>> helps performance. What I still don't understand
>> is how storing a string with contiguous virtual
>> addresses increases the amount of swapping required.
>> Unless you write your own allocator the odds are that
>> splitting up the string into small chunks will
>> spread it across /more/ pages, increasing your
>> working set.
> The paper does assume an allocator that knows how to use pages wisely.
What do you mean by "use pages wisely?" It isn't
just a matter of having a reasonable allocator. You
would need an allocator specifically tuned for your
> Vicente Botet has a nice allocplus proposal which IMO is a good/better
> way to go than the currently constrained allocators defined in C++03.
> Maybe that will make it to C++0x, I haven't checked yet.
You mean Ion, I think. Anyway, I don't see how this
>>> Right. Now if you prevented this from even being possible by making a
>>> string immutable, I'd say it's a win, no? :)
>> I disagree. Reading and writing the same data
>> from multiple threads in parallel is a bad idea
>> in general.
> Sure, so how do you discourage that from being encouraged if all the
> data structures you can share across threads are mutable?
If you want to use a pure functional language,
go ahead. I do not believe in restricting features
because they /can/ be abused under /some/
circumstances. Strings don't have to be shared
across threads. Besides, from what I saw, your
string isn't really immutable. It just doesn't
allow mutable access to characters.
>> There's nothing special about strings.
>> If you want to share immutable data, declare it
>> const. Problem solved. No need to force all
>> strings to be immutable just for this.
> const has not prevented anybody from const_cast<...>'ing anything.
> It's not sufficient that your immutability guarantee only comes from a
> language feature that can even be overridden.
Sorry, but I have little sympathy for those
who abuse const_cast.
> And even if an object is
> const, an implementor of that object's type has the mutable keyword
> giving you a false sense of security in making calls even to const
> member functions of that type.
Well don't do that then. Declaring that a
type is immutable doesn't make you any safer
> Unfortunately as much as I would like to agree with the statement
> "there's nothing special about strings", a large body of computer
> science algorithms dealing with strings primarily (think compression
> algorithms, distance algorithms, regex matching, parsing, etc.) show
> otherwise. Strings are an appropriate abstraction for some kinds of
> data and this is especially true if you're dealing with text-based
> network protocols and user-generated data.
Huh? Why does that mean that strings should
be treated differently from everything else?
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk