Boost logo

Boost :

From: Jens Maurer (Jens.Maurer_at_[hidden])
Date: 2001-11-20 17:22:58


dylan_at_[hidden] wrote:
> Has anyone tried implementing something to support huge containers
> that couldn't possibly fit into memory (even using swap space).

Interesting idea.

> I tried to think about what would be involved in implementating, say,
> std::vector so that you could specify that a certain amount would be
> cached in memory, while the rest was manually read and written to
> disk.

I think that on-disk data structures, partially cached in memory,
need to be different from STL containers --- at least their
implementation, and also their use, if not their interface.

A std::vector<> provides many guarantees, including constant-time
element access and the representability of &*it as a raw pointer,
that it would be very difficult to implement a disk-based vector<>
with a custom allocator that does the swap-in magically.

> Ideally of course it should be possible by implementing just an
> custom allocator. Assuming that a vector's iterator ultimately uses
> allocator::pointer then if you write an allocator where pointer is
> actually a 'smart pointer' that swaps as needed, this would probably
> be 90% of the work.

What about the lifetime of the "raw pointer" returned by &*it?
It must be valid as long as the iterator is valid, which doesn't
work out with a disk-based std::vector<>.

There's a storage and cache hierarchy in modern computers (simplified)
 - CPU cache
 - RAM
 - disk

Access latency and storage size increase top to bottom.
In day-to-day programming, we don't care a lot for the difference
between CPU cache and RAM, which has only about a factor of 10
in latency difference. Data structures such as std::list or
std::map that traverse lots of pointers do harm to access locality
and thus cache efficiency. We can still use them to good advantage.
However, performance-sensitive people do try to arrange their
algorithms so that they make efficient use of the CPU cache.
That's why matrix packages don't use std::list.

Also, the unit of addressing is the same for CPU cache and RAM:
a "raw" pointer. That's different for a disk: a "raw" pointer
may not be able to address all bytes on a disk, so we need
"extended" pointers for that, e.g. block number and offset or
a 64 bit linear offset.

Compared to RAM, a disk has a latency disadvantage of about a
factor of 100,000 (100 nsec vs. 10 msec). For all but the
most trivial applications, access locality is a must to
ensure acceptable performance. Most STL containers are
not appropriate. For example, on-disk, a B-Tree is much
better than a binary tree (such as most std::map
implementations use), because a B-Tree is much more
shallow than a binary tree with the same number of nodes.
Thus, traversing a B-Tree requires far fewer block-reads
from the disk compared to a binary tree.

I'd also like to have an expanded interface for on-disk
data structures to control read-ahead, caching policy etc.

I would be glad to see interface suggestions for on-disk
data structures, based on a common "DiskAllocator"
(I'm lacking a better name.) I'm curious how a "DiskAllocator"
interface that supports various on-disk containers efficiently
would look like.

Jens Maurer


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk