Boost logo

Boost :

Subject: Re: [boost] [string] proposal
From: Dean Michael Berris (mikhailberis_at_[hidden])
Date: 2011-01-28 13:22:03


On Sat, Jan 29, 2011 at 1:43 AM, Steven Watanabe <watanabesj_at_[hidden]> wrote:
> AMDG
>
> On 1/28/2011 12:49 AM, Dean Michael Berris wrote:
>>
>>
>> For multi-core set-ups where you have a NUMA architecture, having one
>> thread allocate memory that's in a given memory controller (in a given
>> core/CPU) that has to be a given size spanning multiple pages will
>> give your OS a hard time finding at least two contiguous memory pages
>> (especially when memory is a scarce resource). That's the virtual
>> memory manager at work and that's a performance killer on most modern
>> (and even not so modern) platforms.
>>
>
> I'm not sure what this has to do with the OS.
> The memory manager exists in user space and
> only goes to the OS when it can't fulfill the
> request with the memory that it has on hand.
>

Which OS are we talking about? At least for Linux I know for a fact
that the Virtual Memory Manager is an OS-level service. I don't know
Windows internals but I'm willing to say that the virtual memory
manager exists as an OS-level service. That means the heap that's
available to an application would have to be allocated by a VMM before
the application even touches the CPU.

Did I get that part wrong?

>> This means, lets say that you have a memory page that's 4kb long, and
>> you need a contiguous string that is say 9kb, then that means your OS
>> would have to find 3 contiguous memory pages to be able to fit a
>> single string. Say that again, a single string needing 3 *contiguous*
>> memory pages. If this allocation happens on one core and then the
>> process is migrated to a different core for whatever reason that
>> controls a different memory module, then just accessing that string
>> that's even in L3 cache would be a problem.
>>
>
> Humph.  9 KB is not all that much memory.
> Besides, how often do you have strings
> this big anyway?
>

When you're downloading HTML pages and/or building HTML pages, or
parsing HTTP requests, it happens a lot. 9kb is a number that's
suitable to show that you need multiple contiguous pages for any
meaningful discussion on what memory paging effects have something to
do with strings and algorithms that deal with strings. Of course this
assumes 4kb pages too, the formula would be the same though with just
values changing for any platform. ;)

>> Then when you change anything in the pages, your write-through cache
>> will see it, and the OS virtual memory manager will mark that page
>> dirty. Next time you need to access memory in the same page in a
>> different core and especially if the data isn't in L3 cache yet, OS
>> sees a dirty page and does all page management necessary to mark that
>> page clean just for reading the data.
>>
>> Now that's fine if the string doesn't grow. But now let's say you
>> concatenate a 4kb string to a 9kb string -- you now then need 5
>> *contiguous* pages to fit all the 13kb data in a single string. That's
>> on top of the storage that's already required by either string.
>>
>
> So, what you're saying is that requiring contiguous
> strings interferes with sharing memory between strings
> and thus increases memory usage.  That I can understand,
> but you already covered this in a separate bullet point.
>

But the effect of finding the 5 contiguous pages is what I was trying
to highlight. Even if you could grow the original buffers of one of
the strings, in a std::string it would still have to be a contiguous
chunk of bytes anyway. :)

>> Then
>> all hell breaks loose as your VMM will have to find these 5 pages that
>> sit right next to each other, potentially paging out stuff just so
>> that it can satisfy this simple concatenation request.
>
> In my experience most programs don't come close
> enough to exhausting their address spaces to
> make this a big deal.  I don't believe that this
> is as apocalyptic as you're making it out in
> general.
>

If we're going to go through "in my experience most programs" I'd say
this happens pretty often. ;)

Of course I'm talking about network services and basically having to
deal with potentially huge data structures in memory along with text
that has to be generated and streamed out to network connections. And
this usually has to be done at pretty crazy rates (think several
thousands of requests per second).

This doesn't even consider the case where you not only have to build
strings that are potentially huge but also competing with mmap'ed
files in the same address space that span multiple contiguous memory
pages as well. :)

>> The time it
>> takes to get something like that done is just unacceptable whenever
>> you need to do something remotely interesting on a system that should
>> be handling thousands of transactions per second -- or even just on a
>> user's machine where you have absolutely no idea what other programs
>> are running and what kind of architecture they have. :)
>>
>
> You need to be more clear about whether you're
> talking about physical addresses or virtual
> addresses.  Contiguous pages do not have to
> be stored in contiguous physical memory.
>

The fact that contiguous pages don't get stored in contiguous physical
memory compounds the problem even more. You feel this a lot when your
system starts thrashing and the VMM does the work of swapping pages
and re-arranging page tables for you. The round-trips between a CPU
page fault and the VMM page fault handler along with potentially
invalidating L1/L2/L3 caches costs are significant in environments
where you need to get as much things done per second as physically
possible. If you can avoid having to go through this dance with a sane
memory management strategy, I think that's a win even in the trivial
case.

> If what you're saying is true than std::deque
> should be preferable to std::vector, but this
> is exactly the opposite of what I've always
> heard.
>

It depends on what you're trying to do and what algorithms you're
applying on your std::dequeue. This is the whole crux of the segmented
iterators debate/paper by Matt Austern and others. Segmented versions
of the algorithms would make using std::dequeue a lot more preferable
if you really needed a data structure that doesn't promote heap
fragmentation as much as a growing std::vector does.

>>
>> In the case where you have two strings trying to access the same

I should have said "two threads thread trying"...

>> string, and the string changes on a process that's working on one
>> processor, the processor has to keep the caches of both up-to-date
>> especially if it's in L1 cache. That means if you're building a string
>> in one thread and simultaneously reading it from another, you're
>> stressing the cache coherence mechanisms used by them multi-core
>> machines.
>
> This sounds like a bad idea to begin with, never
> mind the performance implications.
>

Right. Now if you prevented this from even being possible by making a
string immutable, I'd say it's a win, no? :)

>> On single-core machines that's largely not too much of a
>> problem unless you run into non-local memory access and have to swap
>> things in and out of the cache -- the case for when you have large
>> contiguous chunks of memory that have to be accessed randomly (like
>> strings and vectors).
>>
>
> Sure, but this happens regardless of whether the string
> is stored contiguously or not.  In fact, I would think
> that your proposal would make this worse by allowing
> hidden sharing between apparently unrelated strings.
>

The hidden sharing will only be a problem if you're allowing mutation.
Otherwise it's a good thing from a VMM/Cache perspective all-around
(except for the cache line that has a reference count, but that's
largely only touched when copies/temporaries get created/destroyed,
and in sane systems would be an atomic increment/decrement anyway).

> In Christ,
> Steven Watanabe

HTH

-- 
Dean Michael Berris
about.me/deanberris

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk