Boost logo

Boost-Build :

From: Joao Abecasis (jpabecasis_at_[hidden])
Date: 2005-09-02 00:28:39


Hi!

Rene Rivera wrote:
> Of course a contribution of a "CAT" builtin would be ideal ;-) (I might
> do this myself next week if no one beats me to it.)

I'm attaching an implementation for a "CAT" builtin.

(Let me actually attach the files before I forget... done!)

Attached you can find a diff to CVS jam sources, where the CAT builtin
is defined, and two support files implementing memory mapped files in std C.

I tried searching the archives for CAT and builtin CAT, but didn't find
much. Admittedly this was due to the particular search strings and
search engines I tried... So feel free to point anything from
fundamental flaws to minor glitches on the approach and implementation.

The added rule is:

rule CAT ( file )

CAT takes a single argument, a file name, and returns a string
containing the entire *binary* contents of the file if the file is a
directory or something other than a file an empty list is returned.
Alternatively more than one file could be specified and CAT could be
recursive when applied on directories

(Some implementation considerations follow)

Because strings are immutable in jam, entire files are copied to memory
by the CAT builtin. For this, I looked at newstr but finally decided to
roll up my own allocation scheme. Using newstr would require fiddling
with its implementation and could cost extra allocations.

In the end I implemented memory mapped files anew (actually, memory
copied files), using stdio functions: fseek [1], ftell [1], rewind,
fread. In the future, platform-specific memory-mapped files could be used.

As with newstr a hash table is used to avoid having multiple in-memory
copies of the files. Assuming it should generally be faster, the name of
the filename (not the contents) is used for the key. Currently no
attempt is made at canonicalizing filenames, however.

Because, AFAIK, in jam strings are immutable and not tracked, every
CAT'ed file must be kept in memory for the lifetime of bjam. This may be
suboptimal if one is reading lots of files and discarding the contents.
OTOH, once they're mapped one may do whatever on the files.

Now that this is done I wonder if it would be better to have a GREP
builtin, either instead of CAT or in addition to it. A GREP builtin
would forego mapping files to memory, reducing per-file overhead. Also,
for the use case I suggested initially I could do well with a GREP
instead of a CAT.

Thoughts? Comments?

João Abecasis

[1] fseek and ftell were not previously used in jam sources.
 --------------060609040005080403020404 Content-Type: text/x-patch;
name="jam_cat.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
filename="jam_cat.patch"

? map_file.c
? map_file.h
Index: build.bat
===================================================================
RCS file: /cvsroot/boost/boost/tools/build/jam_src/build.bat,v
retrieving revision 1.32
diff -a -u -r1.32 build.bat
--- build.bat 1 Aug 2005 13:39:46 -0000 1.32
+++ build.bat 2 Sep 2005 04:23:25 -0000
@@ -314,7 +314,7 @@
set BJAM_SOURCES=
set BJAM_SOURCES=%BJAM_SOURCES% command.c compile.c execnt.c expand.c filent.c glob.c hash.c
set BJAM_SOURCES=%BJAM_SOURCES% hdrmacro.c headers.c jam.c jambase.c jamgram.c lists.c make.c make1.c
-set BJAM_SOURCES=%BJAM_SOURCES% newstr.c option.c parse.c pathunix.c regexp.c
+set BJAM_SOURCES=%BJAM_SOURCES% map_file.c newstr.c option.c parse.c pathunix.c regexp.c
set BJAM_SOURCES=%BJAM_SOURCES% rules.c scan.c search.c subst.c timestamp.c variable.c modules.c
set BJAM_SOURCES=%BJAM_SOURCES% strings.c filesys.c builtins.c pwd.c class.c w32_getreg.c native.c
set BJAM_SOURCES=%BJAM_SOURCES% modules/set.c modules/path.c modules/regex.c
Index: build.jam
===================================================================
RCS file: /cvsroot/boost/boost/tools/build/jam_src/build.jam,v
retrieving revision 1.67
diff -a -u -r1.67 build.jam
--- build.jam 19 Jun 2005 19:39:53 -0000 1.67
+++ build.jam 2 Sep 2005 04:23:25 -0000
@@ -280,7 +280,7 @@
command.c compile.c expand.c glob.c
hash.c hcache.c headers.c hdrmacro.c
jam.c jambase.c jamgram.c
- lists.c make.c make1.c newstr.c
+ lists.c make.c make1.c map_file.c newstr.c
option.c parse.c regexp.c rules.c
scan.c search.c subst.c
timestamp.c variable.c modules.c strings.c filesys.c
Index: build.sh
===================================================================
RCS file: /cvsroot/boost/boost/tools/build/jam_src/build.sh,v
retrieving revision 1.35
diff -a -u -r1.35 build.sh
--- build.sh 1 Aug 2005 14:14:43 -0000 1.35
+++ build.sh 2 Sep 2005 04:23:25 -0000
@@ -190,7 +190,7 @@
BJAM_SOURCES="\
command.c compile.c execunix.c expand.c fileunix.c glob.c hash.c\
hdrmacro.c headers.c jam.c jambase.c jamgram.c lists.c make.c make1.c\
- newstr.c option.c parse.c pathunix.c pathvms.c regexp.c\
+ map_file.c newstr.c option.c parse.c pathunix.c pathvms.c regexp.c\
rules.c scan.c search.c subst.c timestamp.c variable.c modules.c\
strings.c filesys.c builtins.c pwd.c class.c native.c modules/set.c\
modules/path.c modules/regex.c modules/property-set.c\
Index: build_vms.com
===================================================================
RCS file: /cvsroot/boost/boost/tools/build/jam_src/build_vms.com,v
retrieving revision 1.5
diff -a -u -r1.5 build_vms.com
--- build_vms.com 1 Jun 2004 05:42:35 -0000 1.5
+++ build_vms.com 2 Sep 2005 04:23:26 -0000
@@ -49,6 +49,7 @@
$ cc 'CC_FLAGS /OBJECT=[.bootstrap_vms]make.obj make.c
$ cc 'CC_FLAGS /OBJECT=[.bootstrap_vms]make1.obj make1.c
$ cc 'CC_FLAGS /OBJECT=[.bootstrap_vms]modules.obj modules.c
+$ cc 'CC_FLAGS /OBJECT=[.bootstrap_vms]map_file.obj map_file.c
$ cc 'CC_FLAGS /OBJECT=[.bootstrap_vms]newstr.obj newstr.c
$ cc 'CC_FLAGS /OBJECT=[.bootstrap_vms]option.obj option.c
$ cc 'CC_FLAGS /OBJECT=[.bootstrap_vms]parse.obj parse.c
Index: builtins.c
===================================================================
RCS file: /cvsroot/boost/boost/tools/build/jam_src/builtins.c,v
retrieving revision 1.43
diff -a -u -r1.43 builtins.c
--- builtins.c 4 Aug 2005 06:38:41 -0000 1.43
+++ builtins.c 2 Sep 2005 04:23:26 -0000
@@ -23,6 +23,7 @@
# include "compile.h"
# include "native.h"
# include "variable.h"
+# include "map_file.h"
# include <ctype.h>

/*
@@ -324,6 +325,12 @@
builtin_shell, 0, args );
}

+ {
+ char * args[] = { "file", 0 };
+ bind_builtin( "CAT",
+ builtin_cat, 0, args);
+ }
+
/* Initialize builtin modules */
init_set();
init_path();
@@ -1612,6 +1619,27 @@

#endif

+/*
+ * builtin_cat() - CAT ( file )
+ *
+ * Returns contents of file.
+ */
+LIST *builtin_cat( PARSE * parse, FRAME * frame )
+{
+ LIST * args = lol_get( frame->args, 0 );
+ char * file = args->string;
+
+ if ( !file_is_file( file ) )
+ return L0;
+
+ file = map_file_map( file );
+
+ if ( file )
+ return list_new( 0, file );
+ else
+ return L0;
+}
+
#ifdef HAVE_POPEN
#if defined(_MSC_VER) || defined(__BORLANDC__)
#define popen _popen
Index: builtins.h
===================================================================
RCS file: /cvsroot/boost/boost/tools/build/jam_src/builtins.h,v
retrieving revision 1.20
diff -a -u -r1.20 builtins.h
--- builtins.h 13 Jul 2005 04:26:01 -0000 1.20
+++ builtins.h 2 Sep 2005 04:23:26 -0000
@@ -47,6 +47,7 @@
LIST *builtin_check_if_file( PARSE *parse, FRAME *frame );
LIST *builtin_python_import_rule( PARSE *parse, FRAME *frame );
LIST *builtin_shell( PARSE *parse, FRAME *frame );
+LIST *builtin_cat( PARSE * parse, FRAME * frame );

void backtrace( FRAME *frame );

Index: jam.c
===================================================================
RCS file: /cvsroot/boost/boost/tools/build/jam_src/jam.c,v
retrieving revision 1.36
diff -a -u -r1.36 jam.c
--- jam.c 6 Jun 2005 12:16:26 -0000 1.36
+++ jam.c 2 Sep 2005 04:23:26 -0000
@@ -84,6 +84,7 @@
* lists.c - maintain lists of strings
* make.c - bring a target up to date, once rules are in place
* make1.c - execute command to bring targets up to date
+ * map_file.c - map (copy) files to memory
* newstr.c - string manipulation routines
* option.c - command line option processing
* parse.c - make and destroy parse trees as driven by the parser
@@ -115,6 +116,7 @@
# include "compile.h"
# include "builtins.h"
# include "rules.h"
+# include "map_file.h"
# include "newstr.h"
# include "scan.h"
# include "timestamp.h"
@@ -497,6 +499,7 @@
donerules();
donestamps();
donestr();
+ map_file_done();

/* close cmdout */

 --------------060609040005080403020404 Content-Type: text/x-csrc;
name="map_file.c"
Content-Transfer-Encoding: 8bit
Content-Disposition: inline;
filename="map_file.c"

/*
* Copyright (c) 2005 João Abecasis
*
* Distributed under the Boost Software License, Version 1.0. (See
* accompanying file LICENSE_1_0.txt or copy at
* http://www.boost.org/LICENSE_1_0.txt)
*/

/*
* map_file.c - map (copy) files to memory.
*
* A hash table of filenames is maintained to avoid mapping files multiple
* times. However no effort is done to canonicalize filenames.
*
* No checks are done to verify filenames correspond to regular files nor that
* the contents are text data. Files are mapped into memory on their entirety
* even if they contain '\0'.
*
* Once a file is mapped on the heap, it will only be freed on map_file_done.
*
*/

#include "map_file.h"
#include "jam.h"
#include "hash.h"

#include <stdlib.h>
#include <stdio.h>

/*
* struct map_file_file - represents a mapped file.
* Doubles as hashdata for hash: filename is the key.
*/
typedef struct map_file_file
{
char * name;
char * data;
long size;
} map_file_file;

/*
* struct map_file_block - represents an allocated memory block.
*/
typedef struct map_file_block
{
struct map_file_block * next;
map_file_file file;

char buffer_start_ /* [ strlen( file.name ) + file.size + 2 ]*/;
} map_file_block;

/*
* Statistics:
*
* map_file_count - number of mapped files (including duplicates through
* map_file_remap and map_file_unmap)
* map_file_total - number of bytes of file data (only)
*/
static long map_file_count = 0;
static long map_file_total = 0;

/*
* map_file_hash - hash of mapped files.
*/
static struct hash * map_file_hash = 0;

/*
* map_file_first_block, map_file_last_block - stack of allocated blocks.
*/
static map_file_block * map_file_first_block = 0, * map_file_last_block;

/*
* allocate() - Allocate a memory block large enough to contain filename and
* file data.
*
* Returns pointer to first byte.
*/
static map_file_file * allocate( char * filename, size_t filesize )
{

size_t fn_len = strlen( filename ) + 1;

map_file_block * block = (map_file_block *)malloc(
offsetof( map_file_block, buffer_start_ ) + fn_len + filesize + 1 );

if ( block )
{
/* Initialize allocated block */
block->next = 0;

block->file.name = &block->buffer_start_;
block->file.data = block->file.name + fn_len;
block->file.size = filesize;

memcpy( block->file.name, filename, fn_len );
block->file.data[filesize] = '\0';

/* Statistics & Maintenance */
++map_file_count;
map_file_total += filesize;

if ( !map_file_first_block )
map_file_first_block = block;
else
map_file_last_block->next = block;

map_file_last_block = block;

/**/
return &block->file;
}
else
{
return 0;
}
}

/*
* map() - maps (copies) a file to the heap.
*
* Returns the mapped file.
*/
static map_file_file * map( char * filename )
{
map_file_file * ret = 0;

FILE * file = fopen( filename, "rb" );

if ( file )
{
if ( !fseek( file, 0L, SEEK_END ) )
{
size_t filesize = ftell( file );

ret = allocate( filename, filesize );

if ( ret )
{
rewind( file );
fread( ret->data, 1, filesize, file );
}
}

fclose( file );
}

return ret;
}

/*
* map_file_map - map (copy) file to memory. If file is already mapped, that
* copy is returned.
*
* Returns pointer to copy of filename.
*/
char * map_file_map( char * filename )
{
map_file_file ff = { 0, 0, 0 }, * f = &ff;

f->name = filename;

if ( !map_file_hash )
map_file_hash = hashinit( sizeof( map_file_file ), "mapped files" );

if ( hashenter( map_file_hash, (HASHDATA **)&f ) )
memcpy( (char *)f, (char *)map( filename ), sizeof( map_file_file ) );

return f->data;
}

/*
* map_file_remap - map (copy) file to memory. File will be copied to memory,
* even if it was previously mapped. Previous copies of file will not be
* deallocated.
*
* Returns pointer to new copy of filename.
*/
char * map_file_remap( char * filename )
{
map_file_file ff = { 0, 0, 0 }, * f = &ff;

f->name = filename;

if ( !map_file_hash )
map_file_hash = hashinit( sizeof( map_file_file ), "mapped files" );

hashenter( map_file_hash, (HASHDATA **)&f );
memcpy( (char *)f, (char *)map( filename ), sizeof( map_file_file ) );

return f->data;
}

/*
* map_file_unmap - discard references to filename from hash table. This
* ensures the file will be read from disk on next call to map_file_map or
* map_file_unmap. Previous copies of file will not be deallocated.
*
* Returns 1 if filename was previously mapped, 0 otherwise.
*/
int map_file_unmap( char * filename )
{
map_file_file ff = { 0, 0, 0 }, * f = &ff;

f->name = filename;

if ( map_file_hash )
return hash_free( map_file_hash, (HASHDATA *)f );
else
return 0;
}

/*
* map_file_done() - Free allocated resources. Return static data to initial
* state.
*/
void map_file_done()
{
map_file_block * current = map_file_first_block;

while ( current )
{
map_file_block * next = current->next;
free( current );
current = next;
}

hashdone( map_file_hash );

if ( DEBUG_MEM )
printf("%dK in %li mapped files.\n", map_file_total / 1024,
map_file_count );

map_file_count = 0;
map_file_total = 0;
map_file_hash = 0;
map_file_first_block = 0;
}
 --------------060609040005080403020404 Content-Type: text/x-chdr;
name="map_file.h"
Content-Transfer-Encoding: 8bit
Content-Disposition: inline;
filename="map_file.h"

/*
* Copyright (c) 2005 João Abecasis
*
* Distributed under the Boost Software License, Version 1.0. (See
* accompanying file LICENSE_1_0.txt or copy at
* http://www.boost.org/LICENSE_1_0.txt)
*/

#ifndef BOOSTJAM_MAP_FILE_H_INCLUDED
#define BOOSTJAM_MAP_FILE_H_INCLUDED

/*
* map_file - map (copy) files to memory.
*/

char * map_file_map( char * filename );
char * map_file_remap( char * filename );
int map_file_unmap( char * filename );
void map_file_done();

#endif /* include guard */
 --------------060609040005080403020404--


Boost-Build list run by bdawes at acm.org, david.abrahams at rcn.com, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk