Boost logo

Boost :

Subject: Re: [boost] [string] proposal
From: Dean Michael Berris (mikhailberis_at_[hidden])
Date: 2011-01-29 04:25:58

On Sat, Jan 29, 2011 at 6:53 AM, Steven Watanabe <watanabesj_at_[hidden]> wrote:
> On 1/28/2011 10:22 AM, Dean Michael Berris wrote:
>> On Sat, Jan 29, 2011 at 1:43 AM, Steven Watanabe<watanabesj_at_[hidden]>
>>  wrote:
>>> I'm not sure what this has to do with the OS.
>>> The memory manager exists in user space and
>>> only goes to the OS when it can't fulfill the
>>> request with the memory that it has on hand.
>> Which OS are we talking about? At least for Linux I know for a fact
>> that the Virtual Memory Manager is an OS-level service. I don't know
>> Windows internals but I'm willing to say that the virtual memory
>> manager exists as an OS-level service. That means the heap that's
>> available to an application would have to be allocated by a VMM before
>> the application even touches the CPU.
> There are two layers.  The application has a memory
> manager in the C library which services most requests.


> This allocator gets more memory from the OS as needed,
> but usually doesn't release it back to the OS immediately.

On Sun the implementation of malloc/free are like this. But on Linux
and other OSes as far as I can tell a call to free will actually tell
the system memory manager that the segment(s) or page(s) in question
are available.

> In ptmalloc2 which is used be glibc, the default settings
> work something like this:
> * If the requested size is at least 128 KB use mmap
>  to let the OS handle the request directly.


> * If there is a block available that's big enough, use it.


> * try to use sbrk to expand the heap.


> * If sbrk fails, use mmap to allocate a 1 MB chunk.

And there we go.

> Note that the OS never gets a mmap request less than
> 128 KB and all the memory allocated via sbrk is always
> contiguous.

Yes, and this is the problem with sbrk: if you rely on your allocator
to use sbrk, sbrk is not guaranteed to just expand the available
segment in place -- which means the VMM no matter what you do will
actually have to find the memory for you and give you that chunk of
memory. This means potentially swapping pages in/out. You limit this
likelihood by asking for page-sized and page-aligned chunks and using
those properly; making data in these chunks immutable makes for much
more sane "shareability" and a much more predictable performance

The problem is then compounded by when your program calls sbrk/brk in
one core and fails -- then mmap is called and the memory returned by
mmap will actually be in the memory module that's handled by a NUMA
CPU. Now imagine using these chunks from another thread that's not
running on the same core and you go downhill from there. That doesn't
even factor the cache hits/misses that can kill your performance
especially if the VMM has to page in/out memory that has been evicted
from the L3 cache of the processors (if there's even an L3).

All of this stems from the fact that you cannot rely on there being
enough memory at the start of the program at any given point. Thinking
about it differently helps this way: if you assume that every memory
allocation you make will cause a page fault and will potentially cause
the VMM to allocate you a page, then you should be using this page of
memory properly and saving on the amount of page faults you generate.
In the context of playing well with the OS VMM, this means being wise
about how you use your memory so that the VMM doesn't have to work too
hard when you request for a data segment of a given size. This works
for sbrk as well because asking for an extension in the amount of
memory available to your program as a "heap" will cause page faults
almost always.

>>> Humph.  9 KB is not all that much memory.
>>> Besides, how often do you have strings
>>> this big anyway?
>> When you're downloading HTML pages and/or building HTML pages, or
>> parsing HTTP requests, it happens a lot. 9kb is a number that's
>> suitable to show that you need multiple contiguous pages for any
>> meaningful discussion on what memory paging effects have something to
>> do with strings and algorithms that deal with strings. Of course this
>> assumes 4kb pages too, the formula would be the same though with just
>> values changing for any platform. ;)
> How big are the strings you're really dealing with?

There's really no way to know up front especially with HTTP traffic
that's coming in as chunk-encoded bytes. ;)

> 9 KB is just insignificant even in a 32 (31?)-bit virtual
> address space.

It is if you have 10,000 concurrent connections and you have to
allocate somewhere around that same amount of data for each
connection's "buffer".

>  As far as I'm concerned, it isn't big
> until you're talking megabytes, at least.

We're not just talking about one string -- I'm talking about several
tens of thousands of these potentially large strings.

> With 64-bit
> you can exhaust physical memory and use the entire
> hard drive for swap and still be an order of magnitude
> short of using up all the virtual address space, even
> when the hardware restricts you to 48-bit virtual addresses.

Yes, but the thing is, you don't have to get to that point for your
VMM to start swapping things in/out of memory. Especially on NUMA
architectures, how Linux behaves is pretty surprising, and that's
another can of worms you don't wanna get into. ;)

>> But the effect of finding the 5 contiguous pages is what I was trying
>> to highlight. Even if you could grow the original buffers of one of
>> the strings, in a std::string it would still have to be a contiguous
>> chunk of bytes anyway. :)
> The page size is simply irrelevant W.R.T.
> the effect of allocating contiguous memory.

It actually is. Note that there's a hardware page size and a virtual
page size. Bus bandwidths matter as well regarding how much data you
can actually pull from main memory into a CPU's cache. These factors
put a predictable upper bound to what you can do efficiently on a
given machine. Of course these are things you cannot control a priori
in your C++ application. This also is a matter of how well your OS
does virtual memory paging.

BUT... what you can control is how you ask for memory from the system.
This means if you can ask for a page-aligned and page-sized worth of
memory -- by knowing sufficiently well enough how big a "page" is at
runtime or a-priori -- then that's better memory usage all around
playing well with the VMM and your caches.

>> If we're going to go through "in my experience most programs" I'd say
>> this happens pretty often. ;)
> Well, looking at my system right now, the
> most memory that any process is using is
> 420 MB.  This is still a lot less than
> the 2 GB limit.  The next memory user
> isn't even close (180 MB) and after that
> it drops off rapidly.  If you're getting
> this close to running out of address space,
> contiguous strings are probably the least
> of your problems anyway.  There's no way
> the 5 contiguous pages make any noticeable
> difference whatsoever.

Right, so given this context, you're saying that using a sane memory
model wouldn't help?

Also when I meant "my experience", I was referring to really demanding
C++ applications that had to deal with thousands of transactions per
second. Even if these transactions just dealt with a few kb's worth of
data that fits in a page, having a thousand of these happening in one
second means having multiple thousand kb's potentially paged in/out in
one second. :D

It's in these kinds of applications where you would really want to
avoid the cost of unnecessarily paging data in/out at the VMM side of
the equation or having a sane allocator that knows how to play nicely
with a VMM.

>> The fact that contiguous pages don't get stored in contiguous physical
>> memory compounds the problem even more.
> Huh?

On NUMA machine with 4 memory controllers total sitting on 4 cores,
imagine a page table where 4 contiguous pages of virtual memory are
mapped across each of these 4 controllers evenly. Now if you're not
making mutations you're saving yourself the round-trip from the CPU's
write-through cache to the appropriate memory controller and
potentially invalidating cached pages on those controllers as well.
This is how NUMA changes the equation and how contiguous pages give
you the "illusion" of contiguity.

This is the hidden cost of mutability in these situations, and how
asking for contiguous virtual memory pages will kill your performance.

Try a simple benchmark: malloc a 1GB randomly generated file into a
64-bit multi-core Intel nehalem based processor, spawn 1,000 threads
on Linux and randomly on each thread change a value in a randomly
picked offset. This will show you how cache misses, VMM paging, and
the cost of mutability of data will actually kill your efficiency. Now
if you change the program to just randomly read from a part of memory,
you'll see how immutability helps. Then change it yet again to malloc
1GB worth of data but this time on multiple files breaking them files
up into smaller pieces. The dramatic improvement in performance should
be evident there. :)

>> You feel this a lot when your
>> system starts thrashing and the VMM does the work of swapping pages
>> and re-arranging page tables for you. The round-trips between a CPU
>> page fault and the VMM page fault handler along with potentially
>> invalidating L1/L2/L3 caches costs are significant in environments
>> where you need to get as much things done per second as physically
>> possible. If you can avoid having to go through this dance with a sane
>> memory management strategy, I think that's a win even in the trivial
>> case.
> I understand perfectly well that avoiding swapping
> helps performance.  What I still don't understand
> is how storing a string with contiguous virtual
> addresses increases the amount of swapping required.
> Unless you write your own allocator the odds are that
> splitting up the string into small chunks will
> spread it across /more/ pages, increasing your
> working set.

The paper does assume an allocator that knows how to use pages wisely.
Vicente Botet has a nice allocplus proposal which IMO is a good/better
way to go than the currently constrained allocators defined in C++03.
Maybe that will make it to C++0x, I haven't checked yet.

> I used to like the idea of segmented algorithms.
> However, I've more recently decided that they
> just make easy things hard and hard things impossible.
> The segmented implementation of for_each is bad enough.
> I don't want even to think about what any algorithm
> that couldn't easily be implemented in terms of for_each
> would look like.

Of course from an implementor's perspective this is true. But from the
user perspective there is absolutely 0 difference for the interface of
the segmented for_each algorithm. :) We are after all writing
libraries that developers (users) will use right? :D

>> Right. Now if you prevented this from even being possible by making a
>> string immutable, I'd say it's a win, no? :)
> I disagree.  Reading and writing the same data
> from multiple threads in parallel is a bad idea
> in general.

Sure, so how do you discourage that from being encouraged if all the
data structures you can share across threads are mutable?

> There's nothing special about strings.
> If you want to share immutable data, declare it
> const.  Problem solved.  No need to force all
> strings to be immutable just for this.

const has not prevented anybody from const_cast<...>'ing anything.
It's not sufficient that your immutability guarantee only comes from a
language feature that can even be overridden. And even if an object is
const, an implementor of that object's type has the mutable keyword
giving you a false sense of security in making calls even to const
member functions of that type.

Unfortunately as much as I would like to agree with the statement
"there's nothing special about strings", a large body of computer
science algorithms dealing with strings primarily (think compression
algorithms, distance algorithms, regex matching, parsing, etc.) show
otherwise. Strings are an appropriate abstraction for some kinds of
data and this is especially true if you're dealing with text-based
network protocols and user-generated data.


Dean Michael Berris

Boost list run by bdawes at, gregod at, cpdaniel at, john at