[iostreams] experience with automatically decompressing with gzip or bzip2

Hello, I've recently used iostreams for easier gziping and bzip2ing of output streams. However, I'd also like to be able to read any file compressed with either gzip or bzip2, by analyzing the input stream and deducing which decompressor to use. I don't want to open the file twice or to seek in the stream, as this might not be possible due to e.g. reading from the network. My solution is to create a custom streambuf, which is told to read and retain the first few bytes to determine which decompressor to push onto the boost::iostreams::filtering_streambuf. Then the actual reading can take place, and the custom streambuf first returns the retained bytes, then just streams the rest of the actual input streambuf. As far as I can tell, this works nicely. I wonder if anybody else has better solutions for this? Perhaps there is come capability of iostreams that I've overlooked? Below is the code, feel free to use as you wish. I don't know if I need to override xsgetc(), underflow() and uflow(), so they are currently just non-working stubs. So far nothing seems to have triggered any of them: only xsgetn seems to be used. I haven't found any actual errors, though. Both gzip and bzip2 files seem to decompress. Otherwise the code can be improved a lot, e.g. by registering separate compression detectors instead of hard-coding them in the streambuf. I think I'm also lost a bit where and when to use streambufs instead of streams, e.g. m_input. Apologies if the code is too long for your mailing-list, I didn't find any guidelines for this. Regards, Marcus #include <boost/iostreams/filtering_streambuf.hpp> #include <boost/iostreams/copy.hpp> #include <boost/iostreams/filter/bzip2.hpp> #include <boost/iostreams/filter/gzip.hpp> #include <iostream> #include <boost/noncopyable.hpp> class general_decompressor_streambuf : public std::basic_streambuf<char, std::char_traits<char> >, public boost::noncopyable { private: std::streambuf& m_input; static const int BUFFER_SIZE = 5; unsigned char m_read[BUFFER_SIZE]; std::streamsize m_readpos; std::string m_compression_type; public: general_decompressor_streambuf(std::streambuf& i) : m_input(i), m_readpos(0) { ; } ~general_decompressor_streambuf() throw () { ; } std::string get_compression_type() const { return m_compression_type; } void resolve_compressor (boost::iostreams::filtering_streambuf<boost::iostreams::input>& sb) { int pos = 0; while (pos < BUFFER_SIZE) { int c = m_input.sbumpc(); if (c == EOF) return; m_read[pos++] = c; } if (m_read[0] == 037 && m_read[1] == 0213) { m_compression_type = "GZIP"; sb.push( boost::iostreams::gzip_decompressor() ); } else if (m_read[0] == 'B' && m_read[1] == 'Z' && m_read[2] == 'h') { m_compression_type = "BZIP2"; sb.push( boost::iostreams::bzip2_decompressor() ); } else { ; } } std::streamsize xsgetn(char* s, std::streamsize n) { std::streamsize cnt = 0; if (m_readpos < BUFFER_SIZE) { while (m_readpos < BUFFER_SIZE && n > 0) { unsigned char ch = m_read[m_readpos++]; *s++ = ch; ++cnt; } if (cnt == n) return cnt; } std::streamsize ss = m_input.sgetn(s, n - cnt); ss += cnt; return ss; } int xsgetc() { std::cerr << "xsgetc" << std::endl; return m_input.sgetc(); } int underflow ( ) { std::cerr << "underflow" << std::endl; return m_input.sgetc(); } int uflow ( ) { std::cerr << "uflow" << std::endl; return m_input.sgetc(); } }; int main() { general_decompressor_streambuf buffering_in_streambuf(*std::cin.rdbuf()); boost::iostreams::filtering_streambuf<boost::iostreams::input> cmpr; buffering_in_streambuf.resolve_compressor(cmpr); cmpr.push(buffering_in_streambuf); std::istream i(&cmpr); boost::iostreams::copy(i, std::cout); return 0; }

Marcus Alanen wrote:
Hello, I've recently used iostreams for easier gziping and bzip2ing of output streams. However, I'd also like to be able to read any file compressed with either gzip or bzip2, by analyzing the input stream and deducing which decompressor to use.
I don't want to open the file twice or to seek in the stream, as this might not be possible due to e.g. reading from the network. My solution is to create a custom streambuf, which is told to read and retain the first few bytes to determine which decompressor to push onto the boost::iostreams::filtering_streambuf. Then the actual reading can take place, and the custom streambuf first returns the retained bytes, then just streams the rest of the actual input streambuf.
This is a nice idea. IMO, the best way to implement it would be as a filter -- perhaps you could call it a stream_signature_filter. You might use it as follows: stream_signature_filter f; f.push("GZIP", gzip_decompressor()); f.push("BZh", bzip2_decompressor()); filtering_istreambuf in(f); in.push(file("archive.tar.gz")); You could then define filters derived from stream_signature_filter that have preset mappings from signatures to filters. I'll add this to my list of ideas of 1.34. Thanks!
Regards, Marcus
Jonathan

Jonathan Turkanis wrote:
This is a nice idea. IMO, the best way to implement it would be as a filter -- perhaps you could call it a stream_signature_filter. You might use it as follows:
stream_signature_filter f; f.push("GZIP", gzip_decompressor()); f.push("BZh", bzip2_decompressor()); filtering_istreambuf in(f); in.push(file("archive.tar.gz"));
You could then define filters derived from stream_signature_filter that have preset mappings from signatures to filters.
This sounds better, but was beyond my Boost knowledge. ( For the m-l archive, I noticed my xsgetn() doesn't decrement the n variable correctly. ) Then again, perhaps the stream_signature_filter should just try out each decompressor in turn, and whichever does not throw an exception should be allowed to continue. Please allow it to stream through unknown compression schemes, especially uncompressed files :-)
I'll add this to my list of ideas of 1.34.
Excellent, thank you! Marcus

Marcus Alanen wrote:
Jonathan Turkanis wrote:
This is a nice idea. IMO, the best way to implement it would be as a filter -- perhaps you could call it a stream_signature_filter. You might use it as follows:
stream_signature_filter f; f.push("GZIP", gzip_decompressor()); f.push("BZh", bzip2_decompressor()); filtering_istreambuf in(f); in.push(file("archive.tar.gz"));
You could then define filters derived from stream_signature_filter that have preset mappings from signatures to filters.
Then again, perhaps the stream_signature_filter should just try out each decompressor in turn, and whichever does not throw an exception should be allowed to continue.
This doesn't generalize well to non-compression filters. Many filters can handle any stream of data without throwing an exception, even if it's not what the user expects.
Please allow it to stream through unknown compression schemes, especially uncompressed files :-)
The way I'd handle this would be to allow signatures to contain wildcard characters, which is necessary anyway for some file formats. Then you could write: stream_signature_filter f; f.push("GZIP", gzip_decompressor()); f.push("BZh", bzip2_decompressor()); f.push("?", identity_filter()); // Wildcard filtering_istreambuf in(f); in.push(file("archive.tar.gz")); This reminds me: despite the algebraic flavor of some of the existing components (null_source, inverse, ...) I never implemented an identity filter.
I'll add this to my list of ideas of 1.34.
Excellent, thank you!
Marcus
Jonathan
participants (2)
-
Jonathan Turkanis
-
Marcus Alanen