Dear list,

I have been using boost.iostreams to read gzipped files for some time, but recently have run into some problems with files that are compressed using tools other than gzip (namely bgzip from http://samtools.svn.sourceforge.net/viewvc/samtools/trunk/samtools/).

Here is the file:

$ hexdump -C ../hello.txt.bgz 
00000000  1f 8b 08 04 00 00 00 00  00 ff 06 00 42 43 02 00  |............BC..|
00000010  35 00 f3 48 cd c9 c9 d7  51 28 c9 c8 2c 56 00 a2  |5..H....Q(..,V..|
00000020  44 85 92 d4 e2 12 85 b4  cc 9c 54 3d 2e 00 86 1e  |D.........T=....|
00000030  ef a4 1c 00 00 00 1f 8b  08 04 00 00 00 00 00 ff  |................|
00000040  06 00 42 43 02 00 1b 00  03 00 00 00 00 00 00 00  |..BC............|
00000050  00 00                                             |..|
00000052

I think this is valid gzip format (based on reading http://www.gzip.org/zlib/rfc-gzip.html).  It differs from what you get by compressing the file using the gzip program in two respects: 1. it uses the 'extra' flag (FLG.EXTRA) to add extra data to each block header, and 2. it has two blocks, the second of which is empty.

My file reading code is as follows:

/* test.cpp */
#include <iostream>
#include <boost/iostreams/filtering_stream.hpp>
#include <boost/iostreams/filter/gzip.hpp>
#include <boost/iostreams/copy.hpp>
#include <boost/iostreams/device/file.hpp>

int main( int, char** ) {
boost::iostreams::filtering_istream stream ;
stream.push( boost::iostreams::gzip_decompressor() ) ;
boost::iostreams::file_source file( argv[1] ) ;
stream.push( file ) ;
boost::iostreams::copy( stream, std::cout ) ;
}


I am using boost 1.48.0 on Ubuntu linux.  The compilation is

$ g++ -o test -g test.cpp -L /home/gav/Projects/Software/usr/lib -lboost_iostreams-gcc46-mt-1_48 -I /home/gav/Projects/Software/boost_1_48_0

Using gunzip on the above file works fine:
$ gunzip -c ../hello.txt.bgz 
Hello, this is a test file.

Using the test program does not:
$ ./test  ../hello.txt.bgz 
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::iostreams::gzip_error> >'
  what():  gzip error

I was able to fix this problem with two code changes in the boost.iostreams implementation, as follows.


Change #1: at libs/iostreams/src/gzip.cpp:65: change "state_ = s_extra;" to "state_ = s_xlen;"
Rationale: without this change, it looks like state s_xlen is never entered so the length of the extra data is not parsed.  Update: I notice this change was already raised in ticket #5908, and has already been fixed in trunk.


Change #2: at libs/iostreams/src/zlib.cpp:153: change "crc_imp_ = 0;" to "crc_ = crc_imp_ = 0;"
Rationale: without this change, an empty block of data does not re-initialise the member variable crc_ of zlib_base.  This causes an exception on line 447 of boost/iostreams/filter/gzip.hpp.

I confirmed this behaviour a second way, by concatenating two files compressed using gzip, the first nonempty and the second empty.  This creates a file which looks like this:

$ hexdump -C hello2.txt.gz 
00000000  1f 8b 08 00 0d ea b4 4f  00 03 cb 48 cd c9 c9 57  |.......O...H...W|
00000010  28 c9 48 2d 4a e5 02 00  8e 45 d1 59 0c 00 00 00  |(.H-J....E.Y....|
00000020  1f 8b 08 00 18 ea b4 4f  00 03 03 00 00 00 00 00  |.......O........|
00000030  00 00 00 00                                       |....|
00000034

This file again decompresses with gunzip but not with the program above.  Change #2 above fixes this.



Could someone take a look at this and let me know if this change is really appropriate?  (Or perhaps I'm doing something wrong.)  I can create a bug report if desired.


Many thanks,
Gavin Band.