|
Boost : |
Subject: Re: [boost] [string] proposal
From: Steven Watanabe (watanabesj_at_[hidden])
Date: 2011-01-28 17:53:22
AMDG
On 1/28/2011 10:22 AM, Dean Michael Berris wrote:
> On Sat, Jan 29, 2011 at 1:43 AM, Steven Watanabe<watanabesj_at_[hidden]> wrote:
>> I'm not sure what this has to do with the OS.
>> The memory manager exists in user space and
>> only goes to the OS when it can't fulfill the
>> request with the memory that it has on hand.
>>
>
> Which OS are we talking about? At least for Linux I know for a fact
> that the Virtual Memory Manager is an OS-level service. I don't know
> Windows internals but I'm willing to say that the virtual memory
> manager exists as an OS-level service. That means the heap that's
> available to an application would have to be allocated by a VMM before
> the application even touches the CPU.
>
There are two layers. The application has a memory
manager in the C library which services most requests.
This allocator gets more memory from the OS as needed,
but usually doesn't release it back to the OS immediately.
In ptmalloc2 which is used be glibc, the default settings
work something like this:
* If the requested size is at least 128 KB use mmap
to let the OS handle the request directly.
* If there is a block available that's big enough, use it.
* try to use sbrk to expand the heap.
* If sbrk fails, use mmap to allocate a 1 MB chunk.
Note that the OS never gets a mmap request less than
128 KB and all the memory allocated via sbrk is always
contiguous.
>>> This means, lets say that you have a memory page that's 4kb long, and
>>> you need a contiguous string that is say 9kb, then that means your OS
>>> would have to find 3 contiguous memory pages to be able to fit a
>>> single string. Say that again, a single string needing 3 *contiguous*
>>> memory pages. If this allocation happens on one core and then the
>>> process is migrated to a different core for whatever reason that
>>> controls a different memory module, then just accessing that string
>>> that's even in L3 cache would be a problem.
>>>
>>
>> Humph. 9 KB is not all that much memory.
>> Besides, how often do you have strings
>> this big anyway?
>>
>
> When you're downloading HTML pages and/or building HTML pages, or
> parsing HTTP requests, it happens a lot. 9kb is a number that's
> suitable to show that you need multiple contiguous pages for any
> meaningful discussion on what memory paging effects have something to
> do with strings and algorithms that deal with strings. Of course this
> assumes 4kb pages too, the formula would be the same though with just
> values changing for any platform. ;)
>
How big are the strings you're really dealing with?
9 KB is just insignificant even in a 32 (31?)-bit virtual
address space. As far as I'm concerned, it isn't big
until you're talking megabytes, at least. With 64-bit
you can exhaust physical memory and use the entire
hard drive for swap and still be an order of magnitude
short of using up all the virtual address space, even
when the hardware restricts you to 48-bit virtual addresses.
> But the effect of finding the 5 contiguous pages is what I was trying
> to highlight. Even if you could grow the original buffers of one of
> the strings, in a std::string it would still have to be a contiguous
> chunk of bytes anyway. :)
>
The page size is simply irrelevant W.R.T.
the effect of allocating contiguous memory.
>>> Then
>>> all hell breaks loose as your VMM will have to find these 5 pages that
>>> sit right next to each other, potentially paging out stuff just so
>>> that it can satisfy this simple concatenation request.
>>
>> In my experience most programs don't come close
>> enough to exhausting their address spaces to
>> make this a big deal. I don't believe that this
>> is as apocalyptic as you're making it out in
>> general.
>>
>
> If we're going to go through "in my experience most programs" I'd say
> this happens pretty often. ;)
>
Well, looking at my system right now, the
most memory that any process is using is
420 MB. This is still a lot less than
the 2 GB limit. The next memory user
isn't even close (180 MB) and after that
it drops off rapidly. If you're getting
this close to running out of address space,
contiguous strings are probably the least
of your problems anyway. There's no way
the 5 contiguous pages make any noticeable
difference whatsoever.
>>
>> You need to be more clear about whether you're
>> talking about physical addresses or virtual
>> addresses. Contiguous pages do not have to
>> be stored in contiguous physical memory.
>>
>
> The fact that contiguous pages don't get stored in contiguous physical
> memory compounds the problem even more.
Huh?
> You feel this a lot when your
> system starts thrashing and the VMM does the work of swapping pages
> and re-arranging page tables for you. The round-trips between a CPU
> page fault and the VMM page fault handler along with potentially
> invalidating L1/L2/L3 caches costs are significant in environments
> where you need to get as much things done per second as physically
> possible. If you can avoid having to go through this dance with a sane
> memory management strategy, I think that's a win even in the trivial
> case.
>
I understand perfectly well that avoiding swapping
helps performance. What I still don't understand
is how storing a string with contiguous virtual
addresses increases the amount of swapping required.
Unless you write your own allocator the odds are that
splitting up the string into small chunks will
spread it across /more/ pages, increasing your
working set.
>> If what you're saying is true than std::deque
>> should be preferable to std::vector, but this
>> is exactly the opposite of what I've always
>> heard.
>>
>
> It depends on what you're trying to do and what algorithms you're
> applying on your std::dequeue. This is the whole crux of the segmented
> iterators debate/paper by Matt Austern and others. Segmented versions
> of the algorithms would make using std::dequeue a lot more preferable
> if you really needed a data structure that doesn't promote heap
> fragmentation as much as a growing std::vector does.
>
I used to like the idea of segmented algorithms.
However, I've more recently decided that they
just make easy things hard and hard things impossible.
The segmented implementation of for_each is bad enough.
I don't want even to think about what any algorithm
that couldn't easily be implemented in terms of for_each
would look like.
>>> string, and the string changes on a process that's working on one
>>> processor, the processor has to keep the caches of both up-to-date
>>> especially if it's in L1 cache. That means if you're building a string
>>> in one thread and simultaneously reading it from another, you're
>>> stressing the cache coherence mechanisms used by them multi-core
>>> machines.
>>
>> This sounds like a bad idea to begin with, never
>> mind the performance implications.
>>
>
> Right. Now if you prevented this from even being possible by making a
> string immutable, I'd say it's a win, no? :)
>
I disagree. Reading and writing the same data
from multiple threads in parallel is a bad idea
in general. There's nothing special about strings.
If you want to share immutable data, declare it
const. Problem solved. No need to force all
strings to be immutable just for this.
>>> In fact, I would think
>> that your proposal would make this worse by allowing
>> hidden sharing between apparently unrelated strings.
>>
>
> The hidden sharing will only be a problem if you're allowing mutation.
> Otherwise it's a good thing from a VMM/Cache perspective all-around
> (except for the cache line that has a reference count, but that's
> largely only touched when copies/temporaries get created/destroyed,
> and in sane systems would be an atomic increment/decrement anyway).
>
Okay.
In Christ,
Steven Watanabe
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk