From: JoÃ£o Abecasis (jpabecasis_at_[hidden])
Date: 2005-09-02 13:51:56
Alexey Pakhunov wrote:
> Joao Abecasis wrote:
>>>1. Loading of >100MB file into the memory is not a good idea;
>>We could add a default 1MB (?) limit and change the signature to,
>> rule CAT ( file : max_bytes ? )
> Sooner or later we'll need to extend this limit. It is even questionable
> to have such a limit as 2GB or 4GB. Of course CAT'ing 4GB file looks
> strange but imagine the following scenario: a user wants to detect some
> signature at the end of a big file. 'CAT' may allow it if an offset can
> be passed. I.e.:
> rule CAT ( file : offset ? : bytes ? ) ;
Adding bytes and offset sounds very reasonable to me. The issue is how
to combine size and offset into the hash in the current implementation.
Maybe such a rule would better fit a MMAP_FILE builtin:
rule MMAP_FILE ( file : offset = 0 ? : bytes = 1 MB ? )
, instead of a CAT with a file size limit. The file size limit should be
looked at as a protection mechanism, nothing more.
Large file support is another problem altogether -- AFAIK stdio in C has
no support for large files (>2GB) out-of-the-box. Adding something like
this to bjam requires a portability/abstraction layer on top of
platform-specific implementations (then again, it might just be POSIX +
Windows...). I'd rather to stay off those grounds for now.
IMO, large file support should be done jam-wide and independently of
CAT. At this point it's not even clear to me that some CAT builtin will
be added at all.
>>Well, I assumed allocation would fail and an empty list would be
> This is a big hole potentially. What if file size is 0x1000000001. The
> value will be truncated to 0x1.
Hmm... I don't think that's how it works. IIUC, with stdio I get to see
up to the first 2GB (?) of a file. So the filesize I can determine with
fseek + ftell is never greater than that. (Note: there could be other
issues from my use of long -> size_t conversions, which I believe I have
fixed in my local copy).
>>Of course another question entirely is if we should care about files
>>that large. I didn't notice support for large files (>2GB) elsewhere in
>>bjam (of course I may have overlooked it).
> I don't know either if bjam supports >2GB files. But if it doesn't than
> we have to add it step by step.
Sure, right now my quest is CAT and possibly GREP. I'm not against
adding large file support. I'd be willing to use a portability layer
someone else contributes ;-)
Again, I think this is independent of CAT.
>>Do you think adding a default or even a hard-coded limit for the number
>>of bytes read would fix these issues or are you suggesting that the
>>approach is flawed from the beginning?
> I think the limit will not solve all problems. I think some kind of
> streaming support should be implemented instead. For example each time
> 'CAT' is called it will read only a single block/line/block of lines.
I also thought of implementing a grep-like rule that'd use streaming and
avoid mapping entire files to memory:
rule GREP ( regexp : files * : recursive ? )
> Other features, I guess, can be useful:
> - Passing an offset and block size to read;
> - Support of negative offsets - to be able to read from the tail of a file;
> - Support of line-by-line reading.
and I'll also add
- querying the size of a file (so we can decide wether and how to CAT).
(it'd be worthwhile to return "this is a large file" ;-)
I think, line-by-line reading is ideal for something like a grep command
where you can inspect the lines and discard them afterwards. But to put
stuff up in memory I thought it'd be better to have it all in one place.
Boost-Build list run by bdawes at acm.org, david.abrahams at rcn.com, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk