Boost logo

Boost :

From: Jody Hagins (jody-boost-011304_at_[hidden])
Date: 2004-04-16 01:14:10


On Fri, 16 Apr 2004 08:16:44 +0200
John Torjo <john.lists_at_[hidden]> wrote:

> But don't use close a file after you've used it? You mean to tell me
> you want to actually keep all 10000 files open at all times? That
> seems a little extreme, or crazy or something ;)

I've been called worse ;-> I really appreciate your input, and have a
little more rationale is at the end, specific to how I use this library.

> If you need something like this (which I think would happen very
> rarely), you can simply have a cache which opens FILEs on demand. You
> just request a read or write with a file name. Internally, if that
> file name is open, do read or write. Otherwise, open it and do the
> same. This should be a very simple class - no more than 100 lines of
> code.

I must not have explained well earlier, because that's what I was trying
to describe. Actually, that is what this library does (and once I wrote
it, I use it quite a bit). However, it is considerably more than 100
lines. Granted, a large part of the code wraps the handles and normal
stdio functions so applications can safely use almost all the normal
stdio functions exactly the same with these cached file pointers. In
addition, I suppose my coding, like my writing, is a bit on the verbose
side.

So, what do I use it for?

I use this library for many things now, but the original need was
post-processing a large amount of US stock market information. I have a
stream of data that gets processed at the end of each day. The data
contains lots of information about each specific stock (e.g. quote and
trade information). However, there are more than 10,000 different
symbols in this file. The post processing splits information for each
symbol up into a separate file, one for each symbol. The vast majority
of the information pertains to a smallish subset of the symbols (a few
hundred). I have found it much easier to handle this information like
so:

fwrite(buffer, recsz, nrecs, symbol_info[sym]<file_ptr>.get());

SIDE NOTE: symbol_info is a std::map<symbol, wjh::dynamic_tuple>. A
dynamic_tuple is kinda like a boost::tuple, except you can add members
by type at run time (you still get compile time type checking though),
and you can access them through named type tags instead of just
integrals. The call to symbol_info[sym]<file_ptr>.get() returns a
reference to an object of type wjh::stdio::cached_file, and the proper
overload of frwite() is called.

So I keep the "cached" file handle as an attribute of the symbol. The
first time that symbol is seen, the file is opened, and the handle is
put into the dynamic_tuple, associated with the type tag file_ptr.
Thus, anytime I want to write to the file associated with that symbol, I
simply do so. The file will be automatically reopened (with proper
mode, and file pointer repositioned) if it has been swapped out to
accomodate other accesses.

I find it nice to use virtual FILE pointers, so I do not have to worry
about running out. In practice, for my apps, I do not experience a
terrible amount of swapping (relative to the number of positive cache
hits).

However, your point is well taken, and while I have many uses for this
library, others may not (unless you need to use lots of files, or use an
OS with very limiting restrictions on the number of open files).

Thanks!!!

-- 
Jody Hagins

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk